1

I select all from a table and create a dataframe (df) out of it using Pyspark. Which is partitioned as:

  partitionBy('date', 't', 's', 'p')

now I want to get number of partitions through using

  df.rdd.getNumPartitions()

but it returns a much larger number (15642 partitions) that expected (18 partitions):

show partitions command in hive:

 date=2019-10-02/t=u/s=u/p=s
 date=2019-10-03/t=u/s=u/p=s
 date=2019-10-04/t=u/s=u/p=s
 date=2019-10-05/t=u/s=u/p=s
 date=2019-10-06/t=u/s=u/p=s
 date=2019-10-07/t=u/s=u/p=s
 date=2019-10-08/t=u/s=u/p=s
 date=2019-10-09/t=u/s=u/p=s
 date=2019-10-10/t=u/s=u/p=s
 date=2019-10-11/t=u/s=u/p=s
 date=2019-10-12/t=u/s=u/p=s
 date=2019-10-13/t=u/s=u/p=s
 date=2019-10-14/t=u/s=u/p=s
 date=2019-10-15/t=u/s=u/p=s
 date=2019-10-16/t=u/s=u/p=s
 date=2019-10-17/t=u/s=u/p=s
 date=2019-10-18/t=u/s=u/p=s
 date=2019-10-19/t=u/s=u/p=s

Any idea why the number of partitions is that huge number? and how can I get number of partitions as expected (18)

Alan
  • 417
  • 1
  • 7
  • 22

2 Answers2

1
spark.sql("show partitions hivetablename").count()

The number of partitions in rdd is different from the hive partitions. Spark generally partitions your rdd based on the number of executors in cluster so that each executor gets fair share of the task. You can control the rdd partitions by using sc.parallelize(, )) , df.repartition() or coalesce().

Sagar
  • 373
  • 1
  • 6
0

I found a detour easier way:

>>> t  = spark.sql("show partitions my_table")
>>> t.count()
18  
Alan
  • 417
  • 1
  • 7
  • 22