Spark (EMR) Partition Pruning Behavior for Multi-Level Partitioned Table

Question

If I have a table created with multi-level partitions i.e. comprising of two columns (state, city) as follows:

    state=CA,city=Anaheim
    state=Texas,city=Houston
    state=Texas,city=Dallas
    state=Texas,city=Austin
    state=CA,city=SanDiego

and if I run a select query like this:

select * from table_name where city=Houston

i.e. where the second partition column has been used, will it just scan the city=Houston partition in state=Texas? I am quite sure that this how Hive operates but keen to confirm the behavior in Spark. Also, will the behavior be any different if its executed in EMR's Spark?

it should read all top level folders; and then all sub folders to figureout where the `houston` is. But this should be very fast as its just folder traversal. — chendu, Nov 12 '19 at 06:05
thanks. Meaning that it won't traverse the data within the second level partition but just the metadata? — rh979, Nov 12 '19 at 06:09
it should not. i dont see a logic why it would. run the query and check once — chendu, Nov 12 '19 at 10:58

score 0 · Answer 1 · answered Nov 12 '19 at 07:56

If you are using hive to store the table, then definitely it will be able to do the partition pruning both for outer and inner partition. Hive keeps the metadata about the partition information about a table separately. Therefore, when the query comes for a particular partition, it is able to do the optimization.

You can actually test this behaviour using explain select * from table_name where city ='Houston';

However, if you are using spark to write the partitions in a nested structure, then I am not so sure. If the query needs to Traverse the whole directory structure, that will be expensive when the number of directories are huge.

score 0 · Answer 2 · answered Nov 13 '19 at 03:09

Let's start with the case of loading data from a file path, vs the metastore. In this case, Spark will first do a recursive file listing to discover the nested partition folders and the files within them. The partition folders are then defined as fields used for partition pruning. So, in your case when you filter on any of the partition columns, Spark will select only the partitions that fulfill that predicate. You can confirm, by using the explain method on a query. Notice below that PartitionCount: 1:

scala> input1.where("city = 'Houston'").explain()
== Physical Plan ==
*(1) FileScan parquet [id#32,state#33,city#34] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data], PartitionCount: 1, PartitionFilters: [isnotnull(city#34), (city#34 = Houston)], PushedFilters: [], ReadSchema: struct<id:int>

Compare that to a query plan without any filters, where PartitionCount: 5:

scala> input1.explain()
== Physical Plan ==
*(1) FileScan parquet [id#55,state#56,city#57] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data], PartitionCount: 5, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>

Now the second case is when you load a partitioned table. In this case the partitions are managed by Hive so it saves you the expensive recursive file listing. When you filter on a partition column, again Spark will select only the relevant partitions. Notice the explain plan below:

scala> input2.where("city = 'Houston'").explain()
== Physical Plan ==
*(1) FileScan parquet default.data[id#39,state#40,city#41] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/tmp/data/state=Texas/city=Houston], PartitionCount: 1, PartitionFilters: [isnotnull(city#41), (city#41 = Houston)], PushedFilters: [], ReadSchema: struct<id:int>

Spark (EMR) Partition Pruning Behavior for Multi-Level Partitioned Table

2 Answers2