I was going through Spark optimization methods and came across various ways to implement to achieve optimization. But two names caught my eyes.
- Partition Pruning
- Predicate Pushdown
They say:
Partition Pruning:
Partition pruning is a performance optimization that limits the number of files and partitions that Spark reads when querying. After partitioning the data, queries that match certain partition filter criteria improve performance by allowing Spark to only read a subset of the directories and files.
Predicate Pushdown:
Spark will attempt to move filtering of data as close to the source as possible to avoid loading unnecessary data into memory. Parquet and ORC files maintain various stats about each column in different chunks aof data (such as min and max values). Programs reading these files can use these indexes to determine if certain chunks, and even entire files, need to be read at all. This allows programs to potentially skip over huge portions of the data during processing.
By reading the above concepts, they appear to do the same thing which is to apply read statements (queries) that satisfy the predicates given in the query. Are Partition Pruning and Predicate Pushdown different concepts or I'm looking at them in a wrong way?