1

I have a list (in Scala).

val seqRDD = sc.parallelize(Seq(("a","b"),("b","c"),("c","a"),("d","b"),("e","c"),("f","b"),("g","a"),("h","g"),("i","e"),("j","m"),("k","b"),("l","m"),("m","j"))) 

I group by the second element for a particular statistics and flatten the result into one list.

val checkItOut = seqRDD.groupBy(each => (each._2))
                   .map(each => each._2.toList)
                   .collect
                   .flatten
                   .toList

The output looks like this:

checkItOut: List[(String, String)] = List((c,a), (g,a), (a,b), (d,b), (f,b), (k,b), (m,j), (b,c), (e,c), (i,e), (j,m), (l,m), (h,g))

Now, what I'm trying to do is "group" all elements (not pairs) that are connected to other elements in any pair to one list. For example: c is with a in one pair, a is with g in its next, so (a,c,g) are connected. Then, c is also with b and e, that b is with a, d, f, k and these are with other characters in some other pair. I want to have them in a list.

I know this can be done with a BFS traversal. BUt wondering if there was an API in Spark that does this?

Anoop Dixith
  • 633
  • 2
  • 8
  • 20
  • You're looking for GraphX, connectedComponents: – Traian Feb 22 '17 at 04:56
  • You can do this with `groupWith` - something I wrote for another question. It adds elements to a group if the predicate matches for any existing member of that group, which is what you need here: http://stackoverflow.com/a/35919875/21755 – The Archetypal Paul Feb 22 '17 at 08:14

1 Answers1

0

GraphX, Connected Components: http://spark.apache.org/docs/latest/graphx-programming-guide.html#connected-components

Traian
  • 1,474
  • 13
  • 11