I have a question regarding the filtering in spark when you do not include the partition columns in the filter.
Imagine that I have the following data partitioned by date:
path/date=20200721/part-0000.parquet
part-0001.parquet
part-0002.parquet
path/date=20200722/part-0000.parquet
part-0001.parquet
part-0002.parquet
...
And the data have one column named "action" which around 30% of the data have a value of 0 and the rest of the data value of 1
If I run the following:
spark.read.parquet("s3a://path").filter("action = 0")
Does spark have to list and scan all the files located in "path" from the source? Or there is some pushdown filtering in place? Or spark only applies a pushdown filter where a partition column is present in the filter?
Thanks.