How to split RDD of (String, Array[String]) into RDD of (String, String) for each item in array?

Question

I have a PairRDD in the form RDD[(String, Array[String])]. I want to flatten the values so that I have an RDD[(String, String)] where each of the elements in the Array[String] of the first RDD become a dedicated element in the 2nd RDD.

For instance, my first RDD has the following elements:

("a", Array("x", "y"))
("b", Array("y", "z"))

The result I want is this:

("a", "x")
("a", "y")
("b", "y")
("b", "z")

How can I do this? flatMapValues(f: Array[String] => TraverableOnce[String]) seems to be the right choice here, but what do I need to use as argument f?

@kaktusito Right thanks; I've updated the question because I was actually looking for the argument to pass into flatMapValues(). You've made that clean. — Carsten, Sep 03 '15 at 18:40
@Carsten I would use `identity` instead of `x => x`. The scala compiler is probably clever enough to realize that that's `identity` but maybe not and then you create a new object. — 2rs2ts, Sep 03 '15 at 18:41
Is there any difference using this instead: `rdd.flatMap{ case (a,b) => b.map(a->_) }` ? Does `flatMapValues` do anything different ? — tuxdna, Sep 04 '15 at 07:47
@tuxdna There's a performance reason, I believe. `flatMap` is not guaranteed to keep the partitioner of the original rdd (since there's no way to check that the keys will remain the same), while `flatMapValues` will. This is important when doing operations that require shuffling, as joins. — ale64bit, Sep 04 '15 at 11:09

score 4 · Accepted Answer · edited Sep 04 '15 at 10:14

4

To achieve the desired result, do:

val rdd1: RDD[(Any, Array[Any])] = ...
val rddFlat: RDD[(Any, Any)] = rdd1.flatMapValues(identity[Array[Any]])

The result looks like the one asked for in the question.

edited Sep 04 '15 at 10:14

Jacek Laskowski

72,696
27
242
420

answered Sep 03 '15 at 18:51

Carsten

1,912
1
28
55

1

protip: It should be a Wiki answer instead since you simply gathered the comments. – Jacek Laskowski Sep 04 '15 at 10:15

How to split RDD of (String, Array[String]) into RDD of (String, String) for each item in array?

1 Answers1