0

I use DataFrame mapPartitions in a library which is loosely implementation of the Uber Case Study. The output DataFrame has some new (large) columns, and the input DataFrame is partitioned and internally sorted before doing mapPartitions. Most users would project on the additional column(s) and then aggregate on the already partitioned column. This causes an expensive redundant shuffle since mapPartitions use planWithBarrier. I wonder if there is a not-to-hacky solution for this in the catalyst api?

Code example:

val resultDF = keysDF
    .select("key1") //non unique
    .join(mappingTable.select("key1", "key2"), "key1") //key1->key2 many to one
    .repartition($"key2")
    .sortWithinPartitions($"key1", $"key2")
    .mapPartitions(appendThreeColumns))(RowEncoder(outputSchema))
    .select("key1", "key2", "value1", "value2", "value3")

As you can see, resultDF is partitioned by key1 (mind the many-to-one relationship) and internally sorted.

However resultDF.groupBy("key1").agg(count("value1")) for example, will cause an Exchange.

Any advice is welcome.

shay__
  • 3,815
  • 17
  • 34
  • I think resultDF is partitioned by key2 not key1 because you do `.repartition($"key2")`. – astro_asz Aug 01 '19 at 09:21
  • @astro_asz No, "mind the many-to-one relationship" – shay__ Aug 01 '19 at 09:43
  • What does `.repartition($"key2")` do? – astro_asz Aug 01 '19 at 10:09
  • I'm not sure what you mean – shay__ Aug 01 '19 at 10:37
  • Sorry just trying to help. Try to do `resultDF.groupBy("key1").agg(count("value1")).explain()` and look for lines `Exchange hashpartitioning` which will tell you what your df is partitioned on. keysDF is partitioned on key1, and after join it is still only partitioned on key1, then it is shuffled on key2 in repartition(key2) (after which It is not longer partitioned by key1) and then shuffles again on key1 for groupBy. – astro_asz Aug 01 '19 at 11:19
  • @astro_asz I appreciate your help but you seem to miss the question - Spark does not even know about `repartition(key2)` because of the analysis barrier. Also, there is a many-to-one relationship between `key1` and `key2` so the output DataFrame is partitioned by both. – shay__ Aug 01 '19 at 11:28

1 Answers1

0

I think you are creating a few more columns with mapPartitions logic the applying aggregate operations due to this you are getting a lot of shuffles across multiple executors. So Spark has a bucketing concept. Please follow this link. Use this concept before mapPartitions then try for aggregations after mapPartitions. I think it will reduce network I/O.

Ravi
  • 424
  • 3
  • 13
  • Thanks Ravi :) I guess the question was badly described because bucketing has very little to do with it. I will add more details – shay__ Aug 01 '19 at 07:18