0

I have a question regarding the filtering in spark when you do not include the partition columns in the filter.

Imagine that I have the following data partitioned by date:

path/date=20200721/part-0000.parquet
                   part-0001.parquet
                   part-0002.parquet
path/date=20200722/part-0000.parquet
                   part-0001.parquet
                   part-0002.parquet
...

And the data have one column named "action" which around 30% of the data have a value of 0 and the rest of the data value of 1

If I run the following:

spark.read.parquet("s3a://path").filter("action = 0")

Does spark have to list and scan all the files located in "path" from the source? Or there is some pushdown filtering in place? Or spark only applies a pushdown filter where a partition column is present in the filter?

Thanks.

AJDF
  • 35
  • 1
  • 5

1 Answers1

2

1.Does spark have to list and scan all the files located in "path" from the source?

Yes, as you are not filtering on partition column spark list and scan all files

2.There is some pushdown filtering in place?

There will be pushdown filter applied on each files while reading

3.spark only applies a pushdown filter where a partition column is present in the filter?

No, Partition filter will be applied where a partition column present or else predicate pushdown will be applied while scanning the file.

partition filter vs pushdown filter

  • You can check all these details by checking explain plan in spark .explain(true)

To check if filter pushdown enabled or not:

spark.sql("set spark.sql.parquet.filterPushdown").show(10,false)
//+--------------------------------+-----+
//|key                             |value|
//+--------------------------------+-----+
//|spark.sql.parquet.filterPushdown|true |
//+--------------------------------+-----+
notNull
  • 30,258
  • 4
  • 35
  • 50