1

By looking at spark-sql plans, is there a way I can tell if a particular table (hive/iceberg) partition is being used or not?

For example, we have a table that has 3 partitions, let's say A=A_VAL, B=B_VAL, C=C_VAL. By looking at the plan is there a way I can tell if

  • the partitions are used fully (all 3 partitions used)
  • the partitions are used only partially (may be only 1 or 2 of the partitions are used, for example partition A is used but now B or C)

If spark-sql plans do not provide this information, is there any way I can get this information?

hba
  • 7,406
  • 10
  • 63
  • 105

1 Answers1

0

You can use below code to print the (logical and physical) plans.

import pyspark.sql

#create a df using your sql
df = sqlContext.sql("SELECT field1 AS f1, field2 as f2 from table1")

#use explain to see explain output. Without argument, you will get only physical plan
df.explain(True)
== Parsed Logical Plan ==
...
== Analyzed Logical Plan ==
...
== Optimized Logical Plan ==
...
== Physical Plan ==
...

EDIT : I ran explain for mytable and posted excerpt below. Which shows hive is choosing only few partitions(folders) and not going through all partitions. You should be able to see similar output.

here table is partitioned on part_col.
query used to generate this explain extended select * from mytab where part_col in (10,50).

sorry, I do not have spark installed so cant test it.

29
      Path -> Alias:
30
        hdfs://namenode:8020/user/hive/warehouse/tmp/part_col=10.0 [tmp]
31
        hdfs://namenode:8020/user/hive/warehouse/tmp/part_col=50.0 [tmp]
32
      Path -> Partition:
33
        hdfs://namenode:8020/user/hive/warehouse/tmp/part_col=10.0 
34
          Partition
35
            base file name: part_col=10.0
36
            input format: org.apache.hadoop.mapred.TextInputFormat
37
...

       hdfs://namenode:8020/user/hive/warehouse/tmp/part_col=50.0 
85
          Partition
86
            base file name: part_col=50.0
87
            input format: org.apache.hadoop.mapred.TextInputFormat
88
            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
89
            partition values:
90
              college_marks 50.0
Koushik Roy
  • 6,868
  • 2
  • 12
  • 33
  • Sorry that's not what I am asking. I am asking given a plan how do I know if partitions are used. – hba Nov 30 '22 at 17:16