2

I have the following code which gives me nodes for GraphX

scala> val idNode = cleanwords.flatMap(x=>x).distinct.zipWithIndex.map{case (k, v) => (k, v.toLong)}
nodesId: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[185] at map at <console>:32

scala> idNode.take(5)
res97: Array[(String, Long)] = Array((cyber crimes,0), (cyber security,1), (india,2), (review,3), (civil society,4))

I got edge List from this:

scala> val edgeList = cleanwords.map(_.combinations(2).toArray).flatMap(x=>x)
edgeList: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[187] at flatMap at <console>:32

scala> edgeList.take(3)
res98: Array[Array[String]] = Array(Array(cyber crimes, cyber security), Array(cyber crimes, review), Array(cyber crimes, india))

Now as GraphX takes Long Ids to create a graph. How should I map idNode with edgeList strings to create graph In GraphX


What I tried:

First I tried base on the link as I am getting the below error:

scala> val t1 = idNode.collect.toMap
t1: scala.collection.immutable.Map[String,Long] = Map(cyber  crimes-> 0, cyber security -> 1, india -> 2, review -> ...

scala> val edges: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (t1.values(x), t1.values(y))}
<console>:48: error: Iterable[Long] does not take parameters
       val edges: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (t1.values(x), t1.values(y))}
                                                                                                                  ^
<console>:48: error: Iterable[Long] does not take parameters
       val edges: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (t1.values(x), t1.values(y))}

As error is signifying, that there are duplicate values related to the key that's why I am getting Iterator but I have already ran distinct on this. So how to get rid of them now? Also the above solution is not scalable for larger dataset as I am using collect in this?

Then another alternative:

val edges2: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (idNode.filter(_._1 == x).collect.toMap.values, idNode.filter(_._1 == y).collect.toMap.values)}

This is also not working.

Can please anyone suggest me how should I build these nodes and edges to build graph in GraphX. Spark version I am using is 2.1.0


Update

Able to find the solution for the non scalable solution:

scala> val edges: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (t1(x), t1(y))} 

Instead of using t1.values(x) use t1(x) to solve the error.

Community
  • 1
  • 1
analyticalpicasso
  • 1,993
  • 8
  • 26
  • 45

0 Answers0