2

I have :

A_RDD = anRDD.map()

B_RDD = A_RDD.aggregateByKey()

Alright, my Question is :

If i put partitionBy(new HashPartitioner) after A_RDD like :

A_RDD = anRDD.map().partitionBy(new HashPartitioner(2))

B_RDD = A_RDD.aggregateByKey()

1)Will this be the same efficient as if i leave it as it is, in the first place? aggregateByKey() will use that hashPartitioner from A_RDD, right?

2)Or If i leave it as in the first example,aggregateByKey() will aggregate every partition by key first, and then send every "aggregated" (key, value) pair in a more efficient way to the right partition?

3)Why doesn't map,flatMap and other transformations on RDDs canNOT take an argument on how to partition the (key, value) pairs on the fly? What I mean is for example during the map() operation on every tuple lets say, => to send also this tuple to a specific partition that has been designated by a partitioner argument on map e.x: map( , Partitioner).

I am trying to grasp the concept of aggregateByKey() how it works, but every time i think i got this, a new question arises... Thanks in advance.

Spar
  • 463
  • 1
  • 5
  • 23

1 Answers1

2
  • If you put partitionBy before aggregateByKey it typically will be less efficient than aggregateByKey alone. You effectively disable map side combine.
  • If you leave there will be map side combine and it is typically more efficient.
  • Non shuffling operations don't take partitioner because there is no data movement. Operations are performed locally on each machine.
  • Thanks for the answer. I have a final question though from your last answer. So if there is no data movement or shuffling in map(), why everybody advices to use mapValues() instead, wherever i can? I know that map() can **change** the keys, so i guessed that it does a shuffling or a repartitioning. If you change the keys with a map() operation on a (Key,Value) pair RDD, doesn't that mean that you change the partitioning? e.x : A_RDD.partitionBy(new HashPartitioner(2)).map() How it is going to partition the keys? – Spar Dec 10 '16 at 21:41
  • No, it means you loose partitiong information which is not the same thing. –  Dec 10 '16 at 22:08