I have a PySpark job that writes about 1.5TB data to HDFS in PARQUET format.
Here are the Spark params:
Num of executor: 500
Driver memory: 16G
Driver cores: 4
Executor memory: 16G
Executor cores: 4
spark.executor.memoryOverhead=16384
spark.sql.shuffle.partitions=2000
spark.default.parallelism=2000
How I write to HDFS(pseudo code):
if table is already exists:
df.write \
.mode('append') \
.format("parquet") \
.insertInto("{}.{}".format(database, table))
else:
df.write \
.mode('overwrite') \
.format("parquet") \
.partitionBy(partition_columns) \
.saveAsTable("{}.{}".format(database, table))
With the same params above, I've done two tests. The only difference between two tests was how many partitions I set to repartition
function before writing to HDFS.
Test 1: repartition final data to 500 partitions: df.repartition(500)
It is much faster and smoother, and there are very small gaps between jobs
https://drive.google.com/file/d/18PgbY_apbXWtZKfAloGj3Tg8C6bQJCSm/view?usp=sharing
https://drive.google.com/file/d/171TbeTb5wIkVjsMxBBIj4EnSStyPmww1/view?usp=sharing
You can see there are multiple jobs and the gap between two jobs are just a few minutes. The longest one is about 8 mins.
Test 2: repartition final data to 2000 partitions: df.repartition(2000) https://drive.google.com/file/d/1PIRI8hDdgCitlIxxdEI4As6SSwrfzYnu/view?usp=sharing
(I took the screenshot when the job was not done yet, so there were less jobs than test 1's screenshots)
The time gap between 1st and 2nd job: 51 mins, and this big time gap happens to each pair of adjacent jobs.
Entire Spark app took about 5 times than test1. Test 1 takes 1 hour to run, test 2 takes 5 hours.
My questions:
- Why writing more files cause so much difference in performance? Shouldn't multiple parquet files for same dataset/dataframe be written concurrently?
- I've noticed that in test2, the Spark job was still writing PARQUET files to HDFS during the gap time(after the job is done on UI). During that time, the PARQUET files kept increasing, and it stopped till there were 2000 parquet files.
It seems Spark does not write to HDFS concurrently, although I have 2000 executor cores(500 executors * 4 cores) and I set
spark.sql.shuffle.partitions=2000
andspark.default.parallelism=2000
. Is it possible to write to HDFS concurrently? - It seems when the Spark UI shows a job is completed, it just means it is completed in memory. But writing to HDFS can still be running, which is the gap. Is it true? See this as well https://weidongzhou.wordpress.com/2016/12/03/mysterious-time-gap-in-spark-job-timeline/
Is it possible that the underlying Hadoop service is throttling the write from Spark?