I have the following code which gives me nodes for GraphX
scala> val idNode = cleanwords.flatMap(x=>x).distinct.zipWithIndex.map{case (k, v) => (k, v.toLong)}
nodesId: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[185] at map at <console>:32
scala> idNode.take(5)
res97: Array[(String, Long)] = Array((cyber crimes,0), (cyber security,1), (india,2), (review,3), (civil society,4))
I got edge List from this:
scala> val edgeList = cleanwords.map(_.combinations(2).toArray).flatMap(x=>x)
edgeList: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[187] at flatMap at <console>:32
scala> edgeList.take(3)
res98: Array[Array[String]] = Array(Array(cyber crimes, cyber security), Array(cyber crimes, review), Array(cyber crimes, india))
Now as GraphX
takes Long
Ids to create a graph. How should I map idNode
with edgeList
strings to create graph In GraphX
What I tried:
First I tried base on the link as I am getting the below error:
scala> val t1 = idNode.collect.toMap
t1: scala.collection.immutable.Map[String,Long] = Map(cyber crimes-> 0, cyber security -> 1, india -> 2, review -> ...
scala> val edges: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (t1.values(x), t1.values(y))}
<console>:48: error: Iterable[Long] does not take parameters
val edges: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (t1.values(x), t1.values(y))}
^
<console>:48: error: Iterable[Long] does not take parameters
val edges: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (t1.values(x), t1.values(y))}
As error is signifying, that there are duplicate values related to the key
that's why I am getting Iterator but I have already ran distinct
on this. So how to get rid of them now?
Also the above solution is not scalable for larger dataset as I am using collect
in this?
Then another alternative:
val edges2: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (idNode.filter(_._1 == x).collect.toMap.values, idNode.filter(_._1 == y).collect.toMap.values)}
This is also not working.
Can please anyone suggest me how should I build these nodes and edges to build graph in GraphX
. Spark
version I am using is 2.1.0
Update
Able to find the solution for the non scalable solution:
scala> val edges: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (t1(x), t1(y))}
Instead of using t1.values(x)
use t1(x)
to solve the error.