1

So, I am trying to read in a hive table in Spark with hiveContext. The job basically reads data from two tables into two Dataframes which are subsequently converted to RDD's. I, then, join them based on a common key. However, this join is failing due to a MetadataFetchFailException (What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?).

I want to avoid that by spreading my data over to other nodes. Currently, even though I have 800 executors, most data is being read into 10 nodes, each of which is using > 50% of its memory.

The question, is, how do I spread the data over to more partitions during the read operation? I do not want to repartition later on.

 val tableDF= hiveContext.read.table("tableName")
                         .select("colId1", "colId2")
                         .rdd
                         .flatMap(sqlRow =>{
                            Array((colId1, colId2))
                         })
Community
  • 1
  • 1
MV23
  • 285
  • 5
  • 17

0 Answers0