Spark DataFrame how to preserve sorting and partitioning information after mapPartitions

Question

I use DataFrame mapPartitions in a library which is loosely implementation of the Uber Case Study. The output DataFrame has some new (large) columns, and the input DataFrame is partitioned and internally sorted before doing mapPartitions. Most users would project on the additional column(s) and then aggregate on the already partitioned column. This causes an expensive redundant shuffle since mapPartitions use planWithBarrier. I wonder if there is a not-to-hacky solution for this in the catalyst api?

Code example:

val resultDF = keysDF
    .select("key1") //non unique
    .join(mappingTable.select("key1", "key2"), "key1") //key1->key2 many to one
    .repartition($"key2")
    .sortWithinPartitions($"key1", $"key2")
    .mapPartitions(appendThreeColumns))(RowEncoder(outputSchema))
    .select("key1", "key2", "value1", "value2", "value3")

As you can see, resultDF is partitioned by key1 (mind the many-to-one relationship) and internally sorted.

However resultDF.groupBy("key1").agg(count("value1")) for example, will cause an Exchange.

Any advice is welcome.

I think resultDF is partitioned by key2 not key1 because you do `.repartition($"key2")`. — astro_asz, Aug 01 '19 at 09:21
Sorry just trying to help. Try to do `resultDF.groupBy("key1").agg(count("value1")).explain()` and look for lines `Exchange hashpartitioning` which will tell you what your df is partitioned on. keysDF is partitioned on key1, and after join it is still only partitioned on key1, then it is shuffled on key2 in repartition(key2) (after which It is not longer partitioned by key1) and then shuffles again on key1 for groupBy. — astro_asz, Aug 01 '19 at 11:19
@astro_asz I appreciate your help but you seem to miss the question - Spark does not even know about `repartition(key2)` because of the analysis barrier. Also, there is a many-to-one relationship between `key1` and `key2` so the output DataFrame is partitioned by both. — shay__, Aug 01 '19 at 11:28

score 0 · Answer 1 · answered Aug 01 '19 at 07:10

0

I think you are creating a few more columns with mapPartitions logic the applying aggregate operations due to this you are getting a lot of shuffles across multiple executors. So Spark has a bucketing concept. Please follow this link. Use this concept before mapPartitions then try for aggregations after mapPartitions. I think it will reduce network I/O.

answered Aug 01 '19 at 07:10

Ravi

424
3
13

Thanks Ravi :) I guess the question was badly described because bucketing has very little to do with it. I will add more details – shay__ Aug 01 '19 at 07:18

Spark DataFrame how to preserve sorting and partitioning information after mapPartitions

1 Answers1