I use DataFrame mapPartitions
in a library which is loosely implementation of the Uber Case Study. The output DataFrame has some new (large) columns, and the input DataFrame is partitioned and internally sorted before doing mapPartitions
. Most users would project on the additional column(s) and then aggregate on the already partitioned column. This causes an expensive redundant shuffle since mapPartitions
use planWithBarrier
. I wonder if there is a not-to-hacky solution for this in the catalyst api?
Code example:
val resultDF = keysDF
.select("key1") //non unique
.join(mappingTable.select("key1", "key2"), "key1") //key1->key2 many to one
.repartition($"key2")
.sortWithinPartitions($"key1", $"key2")
.mapPartitions(appendThreeColumns))(RowEncoder(outputSchema))
.select("key1", "key2", "value1", "value2", "value3")
As you can see, resultDF
is partitioned by key1
(mind the many-to-one relationship) and internally sorted.
However resultDF.groupBy("key1").agg(count("value1"))
for example, will cause an Exchange.
Any advice is welcome.