I'm using Spark structured streaming with Kafka streaming source and Avro format and the creation of dataframe is very slow!
In order to measure the streaming query I have to add an action in order to evaluate the DAG and calculate the time. If I use foreachBatch()
I persist and count the dataframe. Or another way is to write a file and calculate the time from Spark UI. the drawback is that I can't isolate the transformation time because the Kafka read time + transformation + write or persist time are in the same stage.
NOTE: Implemented in Scala(2.11) and Python(2.7.3) with the same results.
The implementation is based on the Spark 2.4.3 documentation (https://spark.apache.org/docs/latest/sql-data-sources-avro.html). The problem is that the .select('avro_struct.*')
is slow.
streaming_df = streaming_df.select(
self._from_avro('value', json.dumps(avro_schema_string))
.alias('avro_struct'))\
.select(F.col('avro_struct.*'))
Dependencies:
- Spark 2.4.3
- Kafka 2.11-2.10
- Scala 2.11.8