What is the correct way to query Hive on Spark for maximum performance?

Question

Spark newbie here. I have a pretty large table in Hive (~130M records, 180 columns) and I'm trying to use Spark to pack it as a parquet file. I'm using the default EMR cluster configuration, 6 * r3.xlarge instances to submit my spark application written in Python. I then run it on YARN, in a cluster mode, usually giving a small amount of memory (couple of gb) to driver, and the rest of it to executors. Here's my code to do so:

from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext(appName="ParquetTest")

hiveCtx = HiveContext(sc)

data = hiveCtx.sql("select * from my_table")
data.repartition(20).write.mode('overwrite').parquet("s3://path/to/myfile.parquet")

Later, I submit it with something similar to this:

spark-submit --master yarn --deploy-mode cluster  --num-executors 5 --driver-memory 4g --driver-cores 1 --executor-memory 24g --executor-cores 2 --py-files test_pyspark.py test_pyspark.py

However, my task takes forever to complete. Spark shuts down all but one worker very quickly after the job starts, since others are not being used, and it takes a few hours before it has all the data from Hive. The Hive table itself is not partitioned or clustered yet (I also need some advices on that).

Could you help me understand what I'm doing wrong, where should I go from here and how to get the maximum performance out of resources I have?

Thank you!

You may want to check out this question/answer. It's not completely up to date, but it should help https://stackoverflow.com/questions/36927918/using-spark-to-write-a-parquet-file-to-s3-over-s3a-is-very-slow/36992096#36992096 — David, Mar 21 '17 at 18:23

score 0 · Answer 1 · answered Mar 21 '17 at 18:19

I had similar use case where I used spark to write to s3 and had performance issue. Primary reason was spark was creating lot of zero byte part files and replacing temp files to actual file name was slowing down the write process. Tried below approach as work around

Write output of spark to HDFS and used Hive to write to s3. Performance was much better as hive was creating less number of part files. Problem I had is(also had same issue when using spark), delete action on Policy was not provided in prod env because of security reasons. S3 bucket was kms encrypted in my case.
Write spark output to HDFS and Copied hdfs files to local and used aws s3 copy to push data to s3. Had second best results with this approach. Created ticket with Amazon and they suggested to go with this one.
Use s3 dist cp to copy files from HDFS to S3. This was working with no issues, but not performant

What is the correct way to query Hive on Spark for maximum performance?

1 Answers1