2

I want to check out if a new graph(called A) is the sub-graph of other graph(called B). And i write a little demo for test, but failed! I run the demo just on spark-shell, spark version 1.6.1:

// Build the GraphB
val usersB = sc.parallelize(Array(
  (3L, ("rxin", "student")),
  (7L, ("jgonzal","postdoc")),
  (5L, ("franklin", "prof")),
  (2L, ("istoica", "prof"))
))

val relationshipsB = sc.parallelize(Array(
  Edge(3L, 7L, "collab"),
  Edge(5L, 3L, "advisor"),
  Edge(2L, 5L, "colleague"),
  Edge(5L, 7L, "pi")
))

val defaultUser = ("John Doe", "Missing")

val graphB = Graph(usersB, relationshipsB, defaultUser)

// Build the initial Graph A
val usersA = sc.parallelize(Array(
  (3L, ("rxin", "student")),
  (7L, ("jgonzal", "postdoc")),
  (5L, ("franklin", "prof"))
))

val relationshipsA = sc.parallelize(Array(
  Edge(3L, 7L, "collab"),
  Edge(5L, 3L, "advisor")
))

val testGraphA = Graph(usersA, relationshipsA, defaultUser)

//do the mask
val maskResult = testGraphA.mask(graphB)
maskResult.edges.count
maskResult.vertices.count

In my understanding of API on spark website, mask funciton could get all the same edges and vertices. However, the result is vertices is correct only( maskResult.vertices.count = 3), the count of edges should be 2 but not(maskResult.edges.count = 0).

David Griffin
  • 13,677
  • 5
  • 47
  • 65
D.Eric
  • 50
  • 7

1 Answers1

2

If you go look at the source, you'll see that mask uses EdgeRDD.innerJoin. If you go look at the documentation for innerJoin, you will see the caveat:

Inner joins this EdgeRDD with another EdgeRDD, assuming both are partitioned using the same PartitionStrategy.

You are going to need to create and use a PartitionStrategy. If you do the following, it will get the results you want (but probably not scale very well):

object MyPartStrat extends PartitionStrategy {
  override def getPartition(s: VertexId, d: VertexId, n: PartitionID) : PartitionID = {
    1     // this is just to prove the point, you'll need a real partition strategy
  }
}

Then if you do:

val maskResult = testGraphA.partitionBy(MyPartStrat).mask(graphB.partitionBy(MyPartStrat))

You will get the result you want. But like I said, you probably need to figure out a better partitioning strategy than just stuffing everything into one partition.

David Griffin
  • 13,677
  • 5
  • 47
  • 65
  • 1
    Nice answer. I would just add that he can choose one of the pre-packed partition strategies that can be found [here](http://spark.apache.org/docs/1.5.1/api/scala/index.html#org.apache.spark.graphx.PartitionStrategy$). So, maybe he doesn't need to actually create one, he could use like `testGraphA.partitionBy(PartitionStrategy.CanonicalRandomVertexCut)` – Daniel de Paula May 17 '16 at 22:22
  • 1
    Nice, will add to my answer later – David Griffin May 17 '16 at 22:43