3

I have a problem running GraphX

val adjGraph= adjGraph_CC.vertices 
   .flatMap { case (id, (compID, adjSet)) => (mapMsgGen(id, compID, adjSet)) } 
      // mapMsgGen will generate a list  of msgs each msg has the form K->V

   .reduceByKey((fst, snd) =>mapMsgMerg(fst, snd)).collect   
      // mapMsgMerg will merge each two msgs  passed to it 

what I was expecting reduceByKey to do is to group the whole output of flatMap by the key (K) and process the list of values (Vs) for each Key (K) using the function provided.

what is happening is the each output of flatMap (using the function mapMsgGen) which is a list of K->V pairs (not the same K usually) is processed immediately using reduceByKey function mapMsgMerg and before the whole flatMap finish.

need some clarification please I don't undestand what is going wrong or is it that I understand flatMap and reduceByKey wrong??

Regards,

Maher

Sachin Janani
  • 1,310
  • 1
  • 17
  • 33

1 Answers1

1

There's no need to produce the entire output of flatMap before starting reduceByKey. In fact, if you're not using the intermediate output of flatMap it's better not to produce it and possibly save some memory.

If your flatMap outputs a list that contains 'k' -> v1 and 'k' -> v2 there's no reason to wait until the entire list has been produced to pass v1 and v2 to mapMsgMerge. As soon as those two tuples are output v1 and v2 can be combined as mapMsgMerge(v1, v2) and v1 and v2 discarded if the intermediate list isn't used.

I don't know the details of the Spark scheduler well enough to say if this is guaranteed behavior but it does seem like an instance of what the original paper calls 'pipelining' of operations.

mrmcgreg
  • 2,754
  • 1
  • 23
  • 26
  • Hi, thanks for the answer it is more clear now,so what is really happening is that reduceByKey function (which is mapMsgMerg) is applied as soon as there are two outputs for function in the flatMap (which is mapMsgGen). – Maher Turifi Dec 26 '14 at 00:14