There are two tables: one big (T0), one small (T1). I run code below and expect it to use DPP, but it does not:
df = spark.table('T0').select('A', 'B', 'C')
df1 = spark.table('T1').select('A')
df.join(F.broadcast(df1), ['A']).explain()
Then I do a hack and it starts working:
values = [_.A for _ in df1.select('A').collect()]
values_type = ArrayType(StringType())
df2 = spark.createDataFrame([values], values_type).select(F.explode('value').alias('A'))
df.join(F.broadcast(df2), ['A']).explain()
In first case I see BatchScan[...] T0 [filters=] RuntimeFilters: []
And in second: BatchScan[...] T0 [filters=] RuntimeFilters: [dynamicpruningexpression(A#5172 IN dynamicpruning#10343)]
What is the difference between two cases? My guess was about nullable = true
for T1
but I tried both options (nullable = true|false
) and it does not work in both.
Don't know if it is important but I use iceberg
as a data storage.