So, I am trying to read in a hive table in Spark with hiveContext. The job basically reads data from two tables into two Dataframes which are subsequently converted to RDD's. I, then, join them based on a common key. However, this join is failing due to a MetadataFetchFailException (What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?).
I want to avoid that by spreading my data over to other nodes. Currently, even though I have 800 executors, most data is being read into 10 nodes, each of which is using > 50% of its memory.
The question, is, how do I spread the data over to more partitions during the read operation? I do not want to repartition later on.
val tableDF= hiveContext.read.table("tableName")
.select("colId1", "colId2")
.rdd
.flatMap(sqlRow =>{
Array((colId1, colId2))
})