dynamic partition prunning not working in spark

Asked Dec 08 '22 at 10:13

Active Dec 08 '22 at 16:44

Viewed 132 times

There are two tables: one big (T0), one small (T1). I run code below and expect it to use DPP, but it does not:

df = spark.table('T0').select('A', 'B', 'C')
df1 = spark.table('T1').select('A')
df.join(F.broadcast(df1), ['A']).explain()

Then I do a hack and it starts working:

values = [_.A for _ in df1.select('A').collect()]
values_type = ArrayType(StringType())
df2 = spark.createDataFrame([values], values_type).select(F.explode('value').alias('A'))
df.join(F.broadcast(df2), ['A']).explain()

In first case I see BatchScan[...] T0 [filters=] RuntimeFilters: []

And in second: BatchScan[...] T0 [filters=] RuntimeFilters: [dynamicpruningexpression(A#5172 IN dynamicpruning#10343)]

What is the difference between two cases? My guess was about nullable = true for T1 but I tried both options (nullable = true|false) and it does not work in both.

Don't know if it is important but I use iceberg as a data storage.

edited Dec 08 '22 at 16:44

Koedlt

4,286
8
15
33

asked Dec 08 '22 at 10:13

Alex Loo

dynamic partition prunning not working in spark

0 Answers0