I have a Spark job that reads in a day worth of data from location A and writes out to location B. The point of the job is to concatenate many small files into a single file for each hive style partition in s3. My code is extremely simple but it runs very very slow.
Code
df = spark.read.parquet('s3://location/A/')
(df
.coalesce(1)
.write
.mode('overwrite')
.partitionBy('date', 'user_id')
.parquet('s3://location/B/'))
Spark Submit
spark-submit \
--master spark://foobar \
--deploy-mode cluster \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.minExecutors=1 \
--conf spark.dynamicAllocation.maxExecutors=18 \
--conf spark.dynamicAllocation.initialExecutors=4 \
--conf spark.executor.memory=4G \
--conf spark.executor.cores=4 \
--conf spark.driver.memory=2G \
--conf spark.shuffle.io.preferDirectBufs=false \
--conf spark.executor.heartbeatInterval=10000000 \
--conf spark.network.timeout=10000000
What kind of configuration can I do to make it run faster, or is coalesce(1) just always going to be very slow?