6

Here is the sample code which i am running.

Creating a test parquet Dataset with mod column as partition.

scala> val test = spark.range(0 , 100000000).withColumn("mod", $"id".mod(40))
test: org.apache.spark.sql.DataFrame = [id: bigint, mod: bigint]

scala> test.write.partitionBy("mod").mode("overwrite").parquet("test_pushdown_filter")

After that, i am reading this data as dataframe and applying filter on partition column i.e. mod.

scala> val df = spark.read.parquet("test_pushdown_filter").filter("mod = 5")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, mod: int]

scala> df.queryExecution.executedPlan
res1: org.apache.spark.sql.execution.SparkPlan =
*FileScan parquet [id#16L,mod#17] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/kprajapa/WorkSpace/places/test_pushdown_filter], PartitionCount: 1, PartitionFilters: [
isnotnull(mod#17), (mod#17 = 5)], PushedFilters: [], ReadSchema: struct<id:bigint>

You can see in execution plan, it is only reading 1 partition.

But if you apply same filter with dataset. its reading all the partition and then applying filter.

scala> case class Test(id: Long, mod: Long)
defined class Test

scala> val ds = spark.read.parquet("test_pushdown_filter").as[Test].filter(_.mod==5)
ds: org.apache.spark.sql.Dataset[Test] = [id: bigint, mod: int]

scala> ds.queryExecution.executedPlan
res2: org.apache.spark.sql.execution.SparkPlan =
*Filter <function1>.apply
+- *FileScan parquet [id#22L,mod#23] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/kprajapa/WorkSpace/places/test_pushdown_filter], PartitionCount: 40, PartitionFilter
s: [], PushedFilters: [], ReadSchema: struct<id:bigint>

Is this how dataset API works? or am i missing something?

Kaushal
  • 3,237
  • 3
  • 29
  • 48
  • 1
    https://stackoverflow.com/questions/50129411/why-is-predicate-pushdown-not-used-in-typed-dataset-api-vs-untyped-dataframe-ap , Yes this explain why pushdown is not working, but is there any way, I can force to spark to apply pushdown, like kind of custom **sparkplan** or some other. – Kaushal Jun 23 '18 at 17:16

0 Answers0