2

I am running the same query on the same dataset with the same Spark version (2.4.0) in two different environments - the explain plan includes DataFilters in the output in the Databricks environment, but doesn't include this in the output on my local machine. I'd like to understand what the DataFilters option means and why it's outputted inconsistently.

I have a sample Parquet dataset that looks like this:

parquetDF.show()
+----------+---------+---------+
|first_name|last_name|  country|
+----------+---------+---------+
|   Ernesto|  Guevara|Argentina|
|  Vladimir|    Putin|   Russia|
|     Maria|Sharapova|   Russia|
|     Bruce|      Lee|    China|
|      Jack|       Ma|    China|
+----------+---------+---------+

When I run parquetDF.where($"first_name" === "Maria").explain() on my local machine from the Spark console, I get this physical plan:

== Physical Plan ==
*(1) Project [first_name#25, last_name#26, country#27]
+- *(1) Filter (isnotnull(first_name#25) && (first_name#25 = Maria))
   +- *(1) FileScan parquet [first_name#25,last_name#26,country#27] 
           Batched: true, 
           Format: Parquet,
           Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/parquet_blog_data], 
           PartitionFilters: [], 
           PushedFilters: [IsNotNull(first_name), 
           EqualTo(first_name,Maria)],
           ReadSchema: struct<first_name:string,last_name:string,country:string>

When I run the same query in Databricks (Runtime 5.2) on the same file in S3, I get this physical plan:

== Physical Plan ==
*(1) Project [first_name#259, last_name#260, country#261]
+- *(1) Filter (isnotnull(first_name#259) && (first_name#259 = Maria))
   +- *(1) FileScan parquet [first_name#259,last_name#260,country#261] 
           Batched: true, 
           DataFilters: [isnotnull(first_name#259), (first_name#259 = Maria)], 
           Format: Parquet, 
           Location: InMemoryFileIndex[some_bucket/parquet], 
           PartitionFilters: [], 
           PushedFilters: [IsNotNull(first_name), 
           EqualTo(first_name,Maria)], 
           ReadSchema: struct<first_name:string,last_name:string,country:string>

What are the DataFilters? Are these filters that are applied at the S3 level? Perhaps these filters are applied before the data is sent from S3 to the ec2 cluster?

Powers
  • 18,150
  • 10
  • 103
  • 108

0 Answers0