0

I have a PairRDD in the form RDD[(String, Array[String])]. I want to flatten the values so that I have an RDD[(String, String)] where each of the elements in the Array[String] of the first RDD become a dedicated element in the 2nd RDD.

For instance, my first RDD has the following elements:

("a", Array("x", "y"))
("b", Array("y", "z"))

The result I want is this:

("a", "x")
("a", "y")
("b", "y")
("b", "z")

How can I do this? flatMapValues(f: Array[String] => TraverableOnce[String]) seems to be the right choice here, but what do I need to use as argument f?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Carsten
  • 1,912
  • 1
  • 28
  • 55
  • Just do `rdd.flatMapValues(x => x)` – ale64bit Sep 03 '15 at 18:32
  • @kaktusito Right thanks; I've updated the question because I was actually looking for the argument to pass into flatMapValues(). You've made that clean. – Carsten Sep 03 '15 at 18:40
  • @Carsten I would use `identity` instead of `x => x`. The scala compiler is probably clever enough to realize that that's `identity` but maybe not and then you create a new object. – 2rs2ts Sep 03 '15 at 18:41
  • 1
    Is there any difference using this instead: `rdd.flatMap{ case (a,b) => b.map(a->_) }` ? Does `flatMapValues` do anything different ? – tuxdna Sep 04 '15 at 07:47
  • @tuxdna There's a performance reason, I believe. `flatMap` is not guaranteed to keep the partitioner of the original rdd (since there's no way to check that the keys will remain the same), while `flatMapValues` will. This is important when doing operations that require shuffling, as joins. – ale64bit Sep 04 '15 at 11:09

1 Answers1

4

To achieve the desired result, do:

val rdd1: RDD[(Any, Array[Any])] = ...
val rddFlat: RDD[(Any, Any)] = rdd1.flatMapValues(identity[Array[Any]])

The result looks like the one asked for in the question.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Carsten
  • 1,912
  • 1
  • 28
  • 55