1

So if i have a transformation before :

myRDD = someRDD.map()

mySecondRDD = myRDD.aggregateByKey(initValue)(CombOp , MergeOp)

In this point myRDD doesn't have a partitioner, but mySecondRDD has one hashPartitioner. Firstly i want to ask:

1)Do i have to designate a partitioner in myRDD? And If i do how is it possible to pass it as an argument in aggregateByKey?

*Note that myRDD is a transformation and hasn't a partitioner

2)Shouldn't at the end of these two commands myRDD have the same partitioner as mySecondRDD instead of none?

3) How many shuffles these 2 commands will do?

4)If i designate a partitioner with partitionBy in myRDD, and manage to pass it as an argument in aggregateByKey will i have reduced the shuffles to 1 instead of 2?

I am sorry i still don't quite get it how it works.

Spar
  • 463
  • 1
  • 5
  • 23

1 Answers1

1

I will try to answer your questions:

  1. You don't have to assign a partitioner explicitly. In the code provided by you, Spark will assign it automatically. If an RDD does not have a partitioner, a default HashPartitioner is used. Take a look here for more details. To specify your own partitioner, you should use another version of aggregateByKey(), which accepts a partitioner alongside with the initial value. It will look like myRdd.aggregateByKey(initialValue, partitioner)(CombOp, MergeOp).

  2. Your mySecondRDD will use the partitioner from myRDD if myRDD already has one and you do not specify a new partitioner in the aggregateByKey() explicitly.

  3. You will have only 1 shuffle operation since the map() transformation will not trigger it. On the contrary, the aggregateByKey() will need to locate records with the same key on one machine.

  4. You will have only one shuffle even if you leave the code as it is.

Anton Okolnychyi
  • 936
  • 7
  • 10
  • Thanks for the answer, but i still have some questions about this : 1) you said : You will have only 1 shuffle operation since the map() transformation will not trigger it. Yes but AggregateByKey which will trigger the "action", map() has to be calculated first! Because AggragateByKey Steps upon this partition. Yes But map() makes a shuffle anyway (It doesn't keep the keys). So this is 1 shuffle from map(), and 1 for AggregateByKey, right? – Spar Dec 10 '16 at 07:36
  • @Spartan I am not sure I understood your comment correctly. Can you provide the DAG of stages from Spark UI? It will tell us how many shuffles you have. I still believe that `map()` is one to one transformation that does not require a shuffle. You can find which transformations cause shuffles [here](http://spark.apache.org/docs/latest/programming-guide.html#shuffle-operations). – Anton Okolnychyi Dec 10 '16 at 10:31
  • Okolnychychyi Well I am trying to understand how Spark works. From what i have read, map() in a (key,Value) RDD doesn't keep the keys, and it doesn't keep the partitioning as well! So that means it repartitions the data, right? But repartition means shuffling! So map() is 1 shuffle and aggregateByKey another? You can check my new question here too : (http://stackoverflow.com/questions/41074276/aggregatebykey-partitioning). Also I am trying to understand why there is partition = none after a map() or mapValues() in an RDD. Thanks – Spar Dec 10 '16 at 10:57