0

What are the potential disadvantages (if any) of streaming micro-batches from a HDFS/S3 backed parquet files as against standard sources like Kafka for a long running Spark Structured Streaming job?

John Subas
  • 81
  • 1
  • 11
  • 1
    One difference I have come across is in the way Spark handles Parquet vs Kafka when discovering new data. In the case of Parquet, it needs to get the list of files already processed from the checkpoint directory and then compare it with list of files in the source and then finally arrive at a list of files to read. This process could be time consuming if each trigger has read many files and the list of files to compare keeps growing during every trigger. Also the input rate control available with maxFilesPerTrigger seems less efficient than maxOffsetsPerTrigger. – John Subas Oct 28 '22 at 19:03

0 Answers0