Let's start with the case of loading data from a file path, vs the metastore. In this case, Spark will first do a recursive file listing to discover the nested partition folders and the files within them. The partition folders are then defined as fields used for partition pruning. So, in your case when you filter on any of the partition columns, Spark will select only the partitions that fulfill that predicate. You can confirm, by using the explain
method on a query. Notice below that PartitionCount: 1
:
scala> input1.where("city = 'Houston'").explain()
== Physical Plan ==
*(1) FileScan parquet [id#32,state#33,city#34] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data], PartitionCount: 1, PartitionFilters: [isnotnull(city#34), (city#34 = Houston)], PushedFilters: [], ReadSchema: struct<id:int>
Compare that to a query plan without any filters, where PartitionCount: 5
:
scala> input1.explain()
== Physical Plan ==
*(1) FileScan parquet [id#55,state#56,city#57] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data], PartitionCount: 5, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
Now the second case is when you load a partitioned table. In this case the partitions are managed by Hive so it saves you the expensive recursive file listing. When you filter on a partition column, again Spark will select only the relevant partitions. Notice the explain
plan below:
scala> input2.where("city = 'Houston'").explain()
== Physical Plan ==
*(1) FileScan parquet default.data[id#39,state#40,city#41] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/tmp/data/state=Texas/city=Houston], PartitionCount: 1, PartitionFilters: [isnotnull(city#41), (city#41 = Houston)], PushedFilters: [], ReadSchema: struct<id:int>