Disadvantages of streaming from Parquet source in Spark Structured Streaming

Asked Oct 28 '22 at 15:32

Active Oct 28 '22 at 18:58

Viewed 127 times

What are the potential disadvantages (if any) of streaming micro-batches from a HDFS/S3 backed parquet files as against standard sources like Kafka for a long running Spark Structured Streaming job?

edited Oct 28 '22 at 18:58

asked Oct 28 '22 at 15:32

John Subas

1

One difference I have come across is in the way Spark handles Parquet vs Kafka when discovering new data. In the case of Parquet, it needs to get the list of files already processed from the checkpoint directory and then compare it with list of files in the source and then finally arrive at a list of files to read. This process could be time consuming if each trigger has read many files and the list of files to compare keeps growing during every trigger. Also the input rate control available with maxFilesPerTrigger seems less efficient than maxOffsetsPerTrigger. – John Subas Oct 28 '22 at 19:03

Disadvantages of streaming from Parquet source in Spark Structured Streaming

0 Answers0