-1

Lets say I have an array of vertices and I want to create edges from them in a way that each vertex connects to the next x vertices. x could have any integer value. Is there a way to do that with Spark?

This is what I have with Scala so far:

//array that holds the edges
    var edges = Array.empty[Edge[Double]]
    for(j <- 0 to vertices.size - 2) {
      for(i <- 1 to x) {
        if((j+i) < vertices.size) {
          //add edge
          edges = edges ++ Array(Edge(vertices(j)._1, vertices(j+i)._1, 1.0))
          //add inverse edge, we want both directions
          edges = edges ++ Array(Edge(vertices(j+i)._1, vertices(j)._1, 1.0))
        }
      }
    }

where vertices variable is an array of (Long, String). But the whole process is of course sequential.

Edit:

For example, if I have vertices as such: Hello, World, and, Planet cosmos. I need the following edges: Hello -> World, World -> Hello, Hello -> and, and -> Hello, Hello -> Planet, Planet -> Hello, World -> and, and -> World, World -> Planet, Planet -> World, World -> cosmos, cosmos -> World, and so on.

zero323
  • 322,348
  • 103
  • 959
  • 935
Al Jenssen
  • 655
  • 3
  • 9
  • 25

2 Answers2

3

Do you mean something like this?

// Add dummy vertices at the end (assumes that you don't use negative ids)
(vertices ++ Array.fill(n)((-1L, null))) 
  .sliding(n + 1) // Slide over n + 1 vertices at the time
  .flatMap(arr => { 
     val (srcId, _) = arr.head // Take first
     // Generate 2n edges
     arr.tail.flatMap{case (dstId, _) => 
       Array(Edge(srcId, dstId, 1.0), Edge(dstId, srcId, 1.0))
     }}.filter(e => e.srcId != -1L & e.dstId != -1L)) // Drop dummies
  .toArray

If you want to run it on a RDD you simply adjust an initial step like this:

import org.apache.spark.mllib.rdd.RDDFunctions._

val nPartitions = vertices.partitions.size - 1

vertices.mapPartitionsWithIndex((i, iter) =>
  if (i == nPartitions) (iter ++ Array.fill(n)((-1L, null))).toIterator
  else iter)

and of course drop toArray. If you want circular connections (tail connected to head) you can replace Array.fill(n)((-1L, null)) with vertices.take(n) and drop filter.

zero323
  • 322,348
  • 103
  • 959
  • 935
2

So, I think this will get you, what you want:

First off, I define a little helper function (note that I have set edge data here to the vertex names so it's easier to inspect visually):

def pairwiseEdges(list: List[(Long, String)]): List[Edge[String]] = {
  list match {
    case x :: xs => xs.flatMap(i => List(Edge(x._1, i._1, x._2 + "--" + i._2), Edge(i._1, x._1, i._2 + "--" + x._2))) ++ pairwiseEdges(xs)
    case Nil => List.empty
  }
}

I do a zipWithIndex on your array to get a key, and then convert the array to an RDD:

val vertices = List((1L,"hello"), (2L,"world"), (3L,"and"), (4L, "planet"), (5L,"cosmos")).toArray
val indexedVertices = vertices.zipWithIndex
val rdd = sc.parallelize(indexedVertices)

And then to generate the edges with x=3:

val edges = rdd
  .flatMap{case((vertexId, name), index) => for {i <- 0 to 3; if (index - i) >= 0} yield ((index - i, (vertexId, name)))}
  .groupByKey()
  .flatMap{case(index, iterable) => pairwiseEdges(iterable.toList)}
  .distinct()

EDIT: Rewrote the flatmap and removed the filter as suggested by @zero323 in comments.

This will generate the following output:

Edge(1,2,hello--world))
Edge(1,3,hello--and))
Edge(1,4,hello--planet)

Edge(2,1,world--hello)
Edge(2,3,world--and)
Edge(2,4,world--planet)
Edge(2,5,world--cosmos)

Edge(3,1,and--hello)
Edge(3,2,and--world)
Edge(3,4,and--planet)
Edge(3,5,and--cosmos)

Edge(4,1,planet--hello)
Edge(4,2,planet--world)
Edge(4,3,planet--and)
Edge(4,5,planet--cosmos)

Edge(5,2,cosmos--world)
Edge(5,3,cosmos--and)
Edge(5,4,cosmos--planet)
Glennie Helles Sindholt
  • 12,816
  • 5
  • 44
  • 50
  • I hope you don't mind a few suggestions: 1) For comprehension can cover both first flatMap and filter `for {i <- 0 to 3; if (index - i) >= 0} yield ((index - i, (vertexId, name))` without any need for mutable data structure, 2) If you decide to shuffle, then partitioning with `RangePartitioner `could be a good idea. It requires additional pass over data but most of the tuples should be already on the right partition, 3) Arguably you can zipWithIndex on RDD, but if data fits in a local array it probably doesn't make sense. Not that processing with RDD makes in such a case but if OP asks.. :) – zero323 Oct 25 '15 at 21:47
  • 4) `ListBuffer` is `GenTraversableOnce` so there is no need for `toList` – zero323 Oct 25 '15 at 21:51
  • @zero323 don't mind the suggestions at all - on the contrary :) I will adjust the code. And btw, I agree that it seems odd to use rdds if data fits in an array, but I had fun seeing if it was in fact possible to generate the edges using rdd :) – Glennie Helles Sindholt Oct 26 '15 at 05:57