How do I increase the number of partitions when I read in a hive table in Spark

Asked Aug 11 '16 at 21:32

Active Aug 12 '16 at 06:45

Viewed 384 times

So, I am trying to read in a hive table in Spark with hiveContext. The job basically reads data from two tables into two Dataframes which are subsequently converted to RDD's. I, then, join them based on a common key. However, this join is failing due to a MetadataFetchFailException (What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?).

I want to avoid that by spreading my data over to other nodes. Currently, even though I have 800 executors, most data is being read into 10 nodes, each of which is using > 50% of its memory.

The question, is, how do I spread the data over to more partitions during the read operation? I do not want to repartition later on.

 val tableDF= hiveContext.read.table("tableName")
                         .select("colId1", "colId2")
                         .rdd
                         .flatMap(sqlRow =>{
                            Array((colId1, colId2))
                         })

edited May 23 '17 at 12:14

Community

asked Aug 11 '16 at 21:32

MV23

How do I increase the number of partitions when I read in a hive table in Spark

0 Answers0