1

I am running hive queries using HiveContext from my Spark code. No matter which query I run and how much data it is, it always generates 31 partitions. Anybody knows the reason? Is there a predefined/configurable setting for it? I essentially need more partitions.

I using this code snippet to execute hive query:

var pairedRDD = hqlContext.sql(hql).rdd.map(...)

I am using Spark 1.3.1

Thanks, Nitin

Nitin
  • 103
  • 1
  • 1
  • 6

1 Answers1

0

The number of partitions in an RDD is the same as the number of partitions in the RDD on which it depends, with a couple exceptions: the coalesce transformation allows creating an RDD with fewer partitions than its parent RDD, the union transformation creates an RDD with the sum of its parents’ number of partitions, and cartesian creates an RDD with their product. To increase number of partitions

  • Use the repartition transformation, which will trigger a shuffle.
  • Configure your InputFormat to create more splits.
  • Write the input data out to HDFS with a smaller block size.

This link here has good explanation of how the number of partitions are defined and how to increase the number of partitions.

dheee
  • 1,588
  • 3
  • 15
  • 25