1

I have a use case where we need to stream Open Source Delta table into multiple queries, filtered on one of the partitioned column. Eg,. Given Delta-table partitioned on year column.

Streaming query 1
spark.readStream.format("delta").load("/tmp/delta-table/").
where("year= 2013")

Streaming query 2
spark.readStream.format("delta").load("/tmp/delta-table/").
where("year= 2014")

The physical plan shows filter after the streaming.

> == Physical Plan == Filter (isnotnull(year#431) AND (year#431 = 2013))
> +- StreamingRelation delta, []

My question is does pushdown predicate works with Streaming queries in Delta? Can we stream only specific partition from the Delta?

Amit Joshi
  • 172
  • 1
  • 14

1 Answers1

1

If the column is already partitioned, only the required partition will be scanned.

Let's create both partitioned and non-partitioned delta table and perform structured streaming.

Partitioned delta table streaming:

val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
    
//sample dataframe
val df = Seq((1,2020),(2,2021),(3,2020),(4,2020),
(5,2020),(6,2020),(7,2019),(8,2019),(9,2018),(10,2020)).toDF("id","year")
    
//partionBy year column and save as delta table
df.write.format("delta").partitionBy("year").save("delta-stream")
    
//streaming delta table
spark.readStream.format("delta").load("delta-stream")
.where('year===2020)
.writeStream.format("console").start().awaitTermination()

physical plan of above streaming query: Notice the partitionFilters

enter image description here

Non-partitioned delta table streaming:

df.write.format("delta").save("delta-stream")

spark.readStream.format("delta").load("delta-stream")
    .where('year===2020)
    .writeStream.format("console").start().awaitTermination()

physical plan of above streaming query: Notice the pushedFilters

enter image description here

Mohana B C
  • 5,021
  • 1
  • 9
  • 28
  • Are you using Open source or Databricks version? In OSS push down filter is not present. Updating the question to mention the Open source version. – Amit Joshi Feb 24 '21 at 09:36
  • @AmitJoshi- Open source – Mohana B C Feb 24 '21 at 09:37
  • Can you pls let me now the version of Delta core used? – Amit Joshi Feb 24 '21 at 09:39
  • spark 3.0.1 and delta-core 0.7.0 – Mohana B C Feb 24 '21 at 09:40
  • I am not sure, but I cannot see the pushed filter. Can you please paste the code where to print the execution plan. May be it is getting truncated for me. – Amit Joshi Feb 24 '21 at 11:01
  • You can easily check that on spark UI - :4040 or using df.explain(true) – Mohana B C Feb 24 '21 at 11:07
  • I am not sure how did you get the pushdown, I executed your code and this is what I got: spark.readStream.format("delta").load("/tmp/delta-test") .where('year===2020) .explain(true) == Physical Plan == *(1) Filter (isnotnull(year#409) AND (year#409 = 2020)) +- StreamingRelation delta, [id#408, year#409] – Amit Joshi Feb 24 '21 at 11:15
  • Could you please check the spark UI once, there you get complete job information. – Mohana B C Feb 24 '21 at 12:03
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229158/discussion-between-amit-joshi-and-mohana-b-c). – Amit Joshi Feb 24 '21 at 12:48