get number of partitions in pyspark

Question

I select all from a table and create a dataframe (df) out of it using Pyspark. Which is partitioned as:

  partitionBy('date', 't', 's', 'p')

now I want to get number of partitions through using

  df.rdd.getNumPartitions()

but it returns a much larger number (15642 partitions) that expected (18 partitions):

show partitions command in hive:

 date=2019-10-02/t=u/s=u/p=s
 date=2019-10-03/t=u/s=u/p=s
 date=2019-10-04/t=u/s=u/p=s
 date=2019-10-05/t=u/s=u/p=s
 date=2019-10-06/t=u/s=u/p=s
 date=2019-10-07/t=u/s=u/p=s
 date=2019-10-08/t=u/s=u/p=s
 date=2019-10-09/t=u/s=u/p=s
 date=2019-10-10/t=u/s=u/p=s
 date=2019-10-11/t=u/s=u/p=s
 date=2019-10-12/t=u/s=u/p=s
 date=2019-10-13/t=u/s=u/p=s
 date=2019-10-14/t=u/s=u/p=s
 date=2019-10-15/t=u/s=u/p=s
 date=2019-10-16/t=u/s=u/p=s
 date=2019-10-17/t=u/s=u/p=s
 date=2019-10-18/t=u/s=u/p=s
 date=2019-10-19/t=u/s=u/p=s

Any idea why the number of partitions is that huge number? and how can I get number of partitions as expected (18)

Sagar · Answer 1 · 2019-10-19T20:50:42.320

1

spark.sql("show partitions hivetablename").count()

The number of partitions in rdd is different from the hive partitions. Spark generally partitions your rdd based on the number of executors in cluster so that each executor gets fair share of the task. You can control the rdd partitions by using sc.parallelize(, )) , df.repartition() or coalesce().

edited Oct 19 '19 at 20:50

answered Oct 19 '19 at 20:45

Sagar

373
1
6

score 0 · Accepted Answer · answered Oct 19 '19 at 19:38

0

I found a detour easier way:

>>> t  = spark.sql("show partitions my_table")
>>> t.count()
18

answered Oct 19 '19 at 19:38

Alan

417
1
7
22

get number of partitions in pyspark

2 Answers2