If spark streaming job involves shuffle and stateful processing, it's easy to generate lots of small files per micro batch. We should decrease the number of files without hurting latency.
Asked
Active
Viewed 1,044 times
1 Answers
1
If using all default configs, one spark streaming micro batch will generate 80 k files. This will casue high qps and latency for hdfs. Better change below configs to reduce checkpoint files.
Config | Default | Suggested |
---|---|---|
spark.sql.streaming.minBatchesToRetain |
100 | 30 |
spark.sql.streaming.stateStore.minDeltasForSnapshot |
10 | 5 |
spark.sql.shuffle.partitions |
200 | Depends on micro batch size, 50 or 100 |
So, total number of files = minBatchesToRetain * 4 (left 2 + right 2) * partitions * operators(each join or aggregation)
If all config are default, it will be 100 * 4 * 200 * 1 = 80 K

Warren Zhu
- 1,355
- 11
- 12
-
which Spark version do you use? It seems in Spar3, the number of partitions doesn't contribute to the total number of files. – Gatsby Lee Sep 28 '22 at 22:30