What is the difference between Spark's Partition Pruning and Predicate Pushdown?

Question

I was going through Spark optimization methods and came across various ways to implement to achieve optimization. But two names caught my eyes.

Partition Pruning
Predicate Pushdown

They say:

Partition Pruning:

Partition pruning is a performance optimization that limits the number of files and partitions that Spark reads when querying. After partitioning the data, queries that match certain partition filter criteria improve performance by allowing Spark to only read a subset of the directories and files.

Predicate Pushdown:

Spark will attempt to move filtering of data as close to the source as possible to avoid loading unnecessary data into memory. Parquet and ORC files maintain various stats about each column in different chunks aof data (such as min and max values). Programs reading these files can use these indexes to determine if certain chunks, and even entire files, need to be read at all. This allows programs to potentially skip over huge portions of the data during processing.

By reading the above concepts, they appear to do the same thing which is to apply read statements (queries) that satisfy the predicates given in the query. Are Partition Pruning and Predicate Pushdown different concepts or I'm looking at them in a wrong way?

Does this answer your question? https://stackoverflow.com/questions/58140612/spark-predicate-push-down-filtering-and-partition-pruning-for-azure-data-lake — mazaneicha, Mar 10 '20 at 13:25

score 3 · Answer 1 · answered Mar 11 '20 at 04:13

3

The difference is about who applies the optimization, where the optimization is applied and which data sources it can be applied to.

Partition pruning is applied by Spark itself before it delegates to a data source handling the file format. It's only applicable in the case of file-based formats as data sources don't have the concept of partition discovery (yet).
Predicate push down delegates filtering of rows to the data source responsible for handling a particular format (Spark's term for a type of data source). Predicate push down is available for both file-based and non-file-based sources, e.g., RDBMS and NoSQL databases.

answered Mar 11 '20 at 04:13

Sim

13,147
9
66
95

can you please provide reference of statements. I agree that predicate pushdown applicable for file as well as DB. But i am not sure about partition pruning applies to files only. I saw some notes that this applies to data source but nowhere mentioned about file specifically. – Abhijit Mar 08 '21 at 11:34
please have look on https://spark.apache.org/docs/3.0.0-preview/api/java/org/apache/spark/sql/dynamicpruning/PartitionPruning.html – Abhijit Mar 08 '21 at 11:34

NAHAKISOR NAAGU · Answer 2 · 2023-02-13T20:25:35.597

Predicate Pushdown is a technique where filters, or predicates, are pushed down to the storage layer of a database management system. This way, only the relevant data is retrieved from the storage layer, reducing the amount of data that needs to be processed and thus improving the performance of the query. By pushing down the predicates, the database system can take advantage of any indexing or other optimization that is present at the storage layer.

Column Pruning is a technique where unnecessary columns are removed from the query processing pipeline. This can improve the performance of a query by reducing the amount of data that needs to be processed, stored in memory, and transferred over the network. The database system determines which columns are necessarily based on the query, and removes the unused columns before the query is executed.

Both Predicate Pushdown and Column Pruning are important optimization techniques that are used in modern database systems to improve query performance and make efficient use of available resources.

What is the difference between Spark's Partition Pruning and Predicate Pushdown?

2 Answers2