0

Say I have this pseudo code in Spark SQL where t1 is a temp view built off of partitioned parquet files in HDFS and t2 is a small lookup file to filter the said temp view

select t1.* 
from t1
where exists(select * 
             from t2 
             where t1.id=t2.id and 
                   lower(t1.col) like lower(t2.pattern))  --to mimic ilike functionality 

Will the optimizer treat lower(t1.col) like lower(t2.pattern) as case insensitive match? Or will it run transformations on these columns before performing the match?

I don't have access to the DAG to see what exactly happens behind the scenes so I am asking here to see if this is a known/documented optimization trick.

Radagast
  • 5,102
  • 3
  • 12
  • 27

1 Answers1

1

I tried to reproduce that case using scala and then I called explain() to get the physical plan (I'm pretty sure sql and scala will have the same physical plan because behind the scene it's the same optimizer named “Catalyst”)

import spark.implicits._
val df1 = spark.sparkContext.parallelize(Seq(("Car", "car"), ("bike", "Rocket"), ("Bus", "BUS"), ("Auto", "Machine") )).toDF("c1", "c2")
df1.filter(lower(col("c1")).equalTo(lower(col("c2")))).explain()

== Physical Plan ==
*(1) Project [_1#3 AS c1#8, _2#4 AS c2#9]
+- *(1) Filter ((isnotnull(_1#3) AND isnotnull(_2#4)) AND (lower(_1#3) = lower(_2#4)))
   +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false, true) AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false, true) AS _2#4]
      +- Scan[obj#2]

As you can see in the logical plan it will call lower each time to compare the 2 values: lower(_1#3) = lower(_2#4).

Btw I tried same thing joining 2 dataframe, then filtering on lower but I got the same result.

I hope this answer your question.

Abdennacer Lachiheb
  • 4,388
  • 7
  • 30
  • 61
  • Thanks for reproducing the scenario. I was seeing the same in my execution plan. I don't know how much of performance penalty I am paying for the `lower()`, but it looks like they just added support for `ilike` in Spark 3.3.0 https://spark.apache.org/releases/spark-release-3-3-0.html. I might as well push for an upgrade – Radagast Dec 14 '22 at 01:48