2

I have a Spark job that reads in a day worth of data from location A and writes out to location B. The point of the job is to concatenate many small files into a single file for each hive style partition in s3. My code is extremely simple but it runs very very slow.

Code

df = spark.read.parquet('s3://location/A/')

(df
    .coalesce(1)
    .write
    .mode('overwrite')
    .partitionBy('date', 'user_id')
    .parquet('s3://location/B/'))

Spark Submit

spark-submit \
    --master spark://foobar \
    --deploy-mode cluster \
    --conf spark.dynamicAllocation.enabled=true \
    --conf spark.shuffle.service.enabled=true \
    --conf spark.dynamicAllocation.minExecutors=1 \
    --conf spark.dynamicAllocation.maxExecutors=18 \
    --conf spark.dynamicAllocation.initialExecutors=4 \
    --conf spark.executor.memory=4G \
    --conf spark.executor.cores=4 \
    --conf spark.driver.memory=2G \
    --conf spark.shuffle.io.preferDirectBufs=false \
    --conf spark.executor.heartbeatInterval=10000000 \
    --conf spark.network.timeout=10000000

What kind of configuration can I do to make it run faster, or is coalesce(1) just always going to be very slow?

moku
  • 4,099
  • 5
  • 30
  • 52
  • Have you looked at post: https://stackoverflow.com/questions/31056476/spark-coalesce-very-slow-even-the-output-data-is-very-small?rq=1 – Explorer Apr 10 '18 at 20:42
  • @Explorer I don't think you can do shuffle=True on the DataFrame writers coalesce: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame – moku Apr 10 '18 at 21:10

1 Answers1

1

The link posted by @Explorer could be helpful. Try repartition(1) on your dataframes, because it's equivalent to coalesce(1, shuffle=True). Be cautious that if your output result is quite large, the job will also be very slow due to the drastic network IO of shuffle.

wei tu
  • 89
  • 1
  • 4