Spark newbie here. I have a pretty large table in Hive (~130M records, 180 columns) and I'm trying to use Spark to pack it as a parquet file. I'm using the default EMR cluster configuration, 6 * r3.xlarge instances to submit my spark application written in Python. I then run it on YARN, in a cluster mode, usually giving a small amount of memory (couple of gb) to driver, and the rest of it to executors. Here's my code to do so:
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext(appName="ParquetTest")
hiveCtx = HiveContext(sc)
data = hiveCtx.sql("select * from my_table")
data.repartition(20).write.mode('overwrite').parquet("s3://path/to/myfile.parquet")
Later, I submit it with something similar to this:
spark-submit --master yarn --deploy-mode cluster --num-executors 5 --driver-memory 4g --driver-cores 1 --executor-memory 24g --executor-cores 2 --py-files test_pyspark.py test_pyspark.py
However, my task takes forever to complete. Spark shuts down all but one worker very quickly after the job starts, since others are not being used, and it takes a few hours before it has all the data from Hive. The Hive table itself is not partitioned or clustered yet (I also need some advices on that).
Could you help me understand what I'm doing wrong, where should I go from here and how to get the maximum performance out of resources I have?
Thank you!