0

I am struggling to understand how i am going to create the following in GraphX in Apache spark. I am given the following:

a hdfs file which has loads of data which comes in the form:

node: ConnectingNode1, ConnectingNode2..

For example:

123214: 521345, 235213, 657323

I need to somehow store this data in an EdgeRDD so that i can create my graph in GraphX, but i have no idea how i am going to go about this.

Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72

1 Answers1

2

After you read your hdfs source and have your data in rdd, you can try something like the following:

import org.apache.spark.rdd.RDD
import org.apache.spark.graphx.Edge
// Sample data
val rdd = sc.parallelize(Seq("1: 1, 2, 3", "2: 2, 3"))

val edges: RDD[Edge[Int]] = rdd.flatMap {
  row => 
    // split around ":"
    val splitted = row.split(":").map(_.trim)
    // the value to the left of ":" is the source vertex:
    val srcVertex = splitted(0).toLong
    // for the values to the right of ":", we split around "," to get the other vertices
    val otherVertices = splitted(1).split(",").map(_.trim)
    // for each vertex to the right of ":", we create an Edge object connecting them to the srcVertex:
    otherVertices.map(v => Edge(srcVertex, v.toLong, 1))
}

Edit

Additionally, if your vertices have a constant default weight, you can create your graph straight from the Edges, so you don't need to create a verticesRDD:

import org.apache.spark.graphx.Graph
val g = Graph.fromEdges(edges, defaultValue = 1)
Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72
  • thanks for all of your help !! i followed what you said and was able to create a val graph, just trying to find a way to see if it worked ! – Rhys Copperthwaite Dec 16 '16 at 20:47
  • I tried doing it the way you said, only thing that didn't work was RDD[Edge[Int]so i just used RDD. but keep getting the following errors: :43: error: not found: value Edge otherVertices.map(v => Edge(srcVertex, v.toLong, 1)) ^ :43: error: type mismatch; found : Array[Nothing] required: TraversableOnce[?] otherVertices.map(v => Edge(srcVertex, v.toLong, 1)) – Rhys Copperthwaite Dec 17 '16 at 10:27
  • Did you import the Edge class? `import org.apache.spark.graphx.Edge`. That's probably the problem and also why `RDD[Edge[Int]]` didn't work – Daniel de Paula Dec 17 '16 at 10:32
  • @RhysCopperthwaite – Daniel de Paula Dec 17 '16 at 10:34
  • thanks ill try that now !, and when i use edge[Int] it gives me the following error... :40: error: value rdd of type org.apache.spark.rdd.RDD[String] does not take type parameters. val edges = rdd[Edge[Int]].flatMap { – Rhys Copperthwaite Dec 17 '16 at 10:46
  • @RhysCopperthwaite I'm sorry, my bad, there was a typo in the code. I edited it now. Please notice that `RDD[Edge[Int]]` is a type, so the code should be `val edges: RDD[Edge[Int]] = rdd.flatMap {...}` – Daniel de Paula Dec 17 '16 at 10:47
  • Awesome it works now !! thank you !! when i run the code it creates ..... edges: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Int]] = MapPartitionsRDD[1] at flatMap at :35 which is fine. but when i try to edges.take(5) or edges.count() to see if the data was created it gives me loads of errors like: 16/12/17 10:51:40 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, moon02.eecs.qmul.ac.uk): java.lang.NumberFormatException: For input string: "hdfs" loads like this, not much errors with the actual code! – Rhys Copperthwaite Dec 17 '16 at 10:57