You can persist the data in parquet format after json read
hcData=sqlContext.read.option("inferSchema","true").json(path)
hcData.write.parquet("hcDataFile.parquet")
val hcDataDF = spark.read.parquet("hcDataFile.parquet")
// create a temporary view in Spark 2.0 or registerAsTemp table in Spark 1.6 and use SQL for further logic
hcDataDF.createOrReplaceTempView("T_hcDataDF")
//This is a manual way of doing RDD checkingpointing (not supported for DataFrames), this will reduce RDD Lineage which will improve performance.
For execution, use Dyanamic Resource Allocation for your spark-submit command:
//Make sure the following are enabled in your cluster, otherwise you can use these parameters in spark-summit command as --conf
• spark.dynamicAllocation.enabled=true
• spark.dynamicAllocation.initialExecutors=5
• spark.dynamicAllocation.minExecutors=5
• spark.shuffle.service.enabled=true
• yarn.nodemanager.aux-services=mapreduce_shuffle,spark_shuffle
• yarn.nodemanager.aux-services.spark_shuffle.class
=org.apache.spark.network.yarn.YarnShuffleService
//Spark-submit command
./bin/spark-submit --class package.hcDataclass \
--master yarn-cluster \
--deploy-mode cluster \
--driver-memory 1G \
--executor-memory 5G\
hcData*.jar
//For dynamic Resource Allocation we don't need to specify the # of executors. //Job will automatically get the resources based on cluster bandwidth.