0

I have a Spark RDD of type (Array[breeze.linalg.DenseVector[Double]], breeze.linalg.DenseVector[Double]). I wish to flatten its key to transform it into a RDD of type breeze.linalg.DenseVector[Double], breeze.linalg.DenseVector[Double]). I am currently doing:

val newRDD = oldRDD.flatMap(ob => anonymousOrdering(ob))

The signature of anonymousOrdering() is String => (Array[DenseVector[Double]], DenseVector[Double]).

It returns type mismatch: required: TraversableOnce[?]. The Python code doing the same thing is:

newRDD = oldRDD.flatMap(lambda point: [(tile, point) for tile in anonymousOrdering(point)])

How to do the same thing in Scala ? I generally use flatMapValuesbut here I need to flatten the key.

Armand Grillet
  • 3,229
  • 5
  • 30
  • 60
  • Could you specify the signature of `anonymousOrdering`? Also after flattening the type of the RDD is the same in your question. Is that intentional? – Gábor Bakos Aug 15 '16 at 18:20
  • Signature added (comment in the first snippet), my intention is to transform a RDD containing (Array(1, 2), 3) into a RDD containing (1, 3) | (2, 3). I have replaced the type DenseVector by an integer for this example. – Armand Grillet Aug 15 '16 at 18:23

3 Answers3

2

If I understand your question correctly, you can do:

val newRDD = oldRDD.flatMap(ob => anonymousOrdering(ob))
// newRDD is RDD[(Array[DenseVector], DenseVector)]

In that case, you can "flatten" the Array portion of the tuple using pattern matching and a for/yield statement:

newRDD = newRDD.flatMap{case (a: Array[DenseVector[Double]], b: DenseVector[Double]) => for (v <- a) yield (v, b)}
// newRDD is RDD[(DenseVector, DenseVector)]

Although it's still not clear to me where/how you want to use groupByKey

Alfredo Gimenez
  • 2,174
  • 1
  • 14
  • 19
  • I am removing the groupByKey() at the end of the map as it is not related to the question. Thanks for the answer. – Armand Grillet Aug 15 '16 at 20:04
  • val newRDD = oldRDD.flatMap(ob => anonymousOrdering(ob)) returns `found : (Array[breeze.linalg.DenseVector[Double]], breeze.linalg.DenseVector[Double]), required: TraversableOnce[?]` – Armand Grillet Aug 15 '16 at 20:14
  • 1
    Looks like the issue is inside of `anonymousOrdering` then... See here: http://stackoverflow.com/questions/30833618/how-do-i-flatmap-a-row-of-arrays-into-multiple-rows – Alfredo Gimenez Aug 15 '16 at 20:36
  • The method used in the link is not available for RDDs, only DataFrames. – Armand Grillet Aug 15 '16 at 21:33
  • It was a link for reference, not a solution--I can't debug your `anonymousOrdering` function just by looking at the type signature and error. Your original question was about flattening keys, maybe re-accept this and ask a new question? – Alfredo Gimenez Aug 15 '16 at 21:48
  • Yes sorry, I will write a new question and the answer soon as this one was not clear enough. – Armand Grillet Aug 15 '16 at 21:50
0

Change the code to use Map instead of FlatMap:

val newRDD = oldRDD.map(ob => anonymousOrdering(ob)).groupByKey()

You would only want to use flatmap here if anonymousOrdering returned a list of tuples and you wanted it flattened down.

0

As anonymousOrdering() is a function that you have in your code, update it in order to return a Seq[(breeze.linalg.DenseVector[Double], breeze.linalg.DenseVector[Double])]. It is like doing (tile, point) for tile in anonymousOrdering(point)] but directly at the end of the anonymous function. The flatMap will then take care to create one partition for each element of the sequences.

As a general rule, avoid having a collection as a key in a RDD.

Armand Grillet
  • 3,229
  • 5
  • 30
  • 60
  • You asked how to flatten a key, I answered and you accepted my answer, but then you did a workaround for your code so that you don't have to flatten the key anymore, and accepted your workaround as a "solution"... You also never posted the contents of `anonymousOrdering`, which had the actual problem. Bad form! – Alfredo Gimenez Aug 18 '16 at 19:08
  • I accepted your answer again but it did not work and the real answer is just to not have a RDD with arrays as keys. – Armand Grillet Aug 18 '16 at 19:25