1

There are two tables: one big (T0), one small (T1). I run code below and expect it to use DPP, but it does not:

df = spark.table('T0').select('A', 'B', 'C')
df1 = spark.table('T1').select('A')
df.join(F.broadcast(df1), ['A']).explain()

Then I do a hack and it starts working:

values = [_.A for _ in df1.select('A').collect()]
values_type = ArrayType(StringType())
df2 = spark.createDataFrame([values], values_type).select(F.explode('value').alias('A'))
df.join(F.broadcast(df2), ['A']).explain()

In first case I see BatchScan[...] T0 [filters=] RuntimeFilters: []

And in second: BatchScan[...] T0 [filters=] RuntimeFilters: [dynamicpruningexpression(A#5172 IN dynamicpruning#10343)]

What is the difference between two cases? My guess was about nullable = true for T1 but I tried both options (nullable = true|false) and it does not work in both.

Don't know if it is important but I use iceberg as a data storage.

Koedlt
  • 4,286
  • 8
  • 15
  • 33
Alex Loo
  • 73
  • 1
  • 1
  • 7

0 Answers0