3

I'm trying to identify strongly connected communities within large group (undirected weighted graph). Alternatively, identifying vertices causing connection of sub-groups (communities) that would be otherwise unrelated.

The problem is part of broader Databricks solution thus Spark GraphX and GraphFrames are the first choice for resolving it.

As you can see from attached picture, I need to find vertex "X" as a point where can be split big continuous group identified by connected componect algorithms (val result = g.connectedComponents.run())

Strongly connected components method (for directed graph only), Triangle counting, or LPA community detection algorithms are not suitable, even if all weights are same, e.g. 1.

Picture with point, where should be cut big group ST0

Similar logic is nice described in question "Cut in a Weighted Undirected Connected Graph", but as a mathematical expression only.

Thanks for any hint.

// Vertex DataFrame
val v = sqlContext.createDataFrame(List( 
  (1L, "A-1", 1),       // "St-1"
  (2L, "B-1", 1),
  (3L, "C-1", 1),
  (4L, "D-1", 1),

  (5L, "G-2", 1),      // "St-2"
  (6L, "H-2", 1),
  (7L, "I-2", 1),
  (8L, "J-2", 1),  
  (9L, "K-2", 1),

  (10L, "E-3", 1),     // St-3
  (11L, "F-3", 1),
  (12L, "Z-3", 1),

  (13L, "X-0", 1)      // split point
)).toDF("id", "name", "myGrp")

// Edge DataFrame
val e = sqlContext.createDataFrame(List( 
  (1L, 2L, 1),
  (1L, 3L, 1),
  (1L, 4L, 1),
  (1L, 13L, 5),  // critical edge
  (2L, 4L, 1),

  (5L, 6L, 1),
  (5L, 7L, 1),
  (5L, 13L, 7),   // critical edge
  (6L, 9L, 1),    
  (6L, 8L, 1),  
  (7L, 8L, 1),   

  (12L, 10L, 1),
  (12L, 11L, 1),
  (12L, 13L, 9),  // critical edge
  (10L, 11L, 1)
)).toDF("src", "dst", "relationship")

val g = GraphFrame(v, e)
Dan
  • 494
  • 2
  • 14
Palo
  • 31
  • 2
  • Interesting question! Could you elaborate on why "Triangle counting, or LPA community detection algorithms are not suitable"? From the sketch that you attached a triangle or loop count would do the trick, wouldn't it? – JanLauGe May 29 '20 at 14:18
  • @JanLauGe, you're right the triangle counting would narrow the options. There would be 0 triangles for X. However, you'd get 0 for C and K too. Now imagine, there would be additional nodes connected to C or K. Do you see any way how to leverage triangle coun in such case? – Dan Jun 19 '20 at 19:43
  • 1
    In the example K and C are terminal vertices. If that is by design and not just coincidence we could cut only edges of non-terminal nodes without triangles. As you correctly point out though, if there are additional nodes connected to C and K this doesn't "cut it" anymore (see what I did there?)... Depending on the actual data, perhaps a ratio of triangle count to degree centrality might be helpful? – JanLauGe Jul 16 '20 at 11:15
  • @JanLauGe Good point! Absence of triangle on non-terminal node identifies suspicious vertices. There might be some risk for clusters with missing nodes (e.g. if there were no B - D edge, the A would be flagged exactly same as X). That means some additional method would be needed, but your idea helps! – Dan Jul 21 '20 at 21:43
  • @Palo did you ever write this code? I would love to be able to reference it if you wouldn't mind posting the result? Thanks! – John Smith Jul 08 '21 at 15:54

1 Answers1

1

Betweenness centrality seems to be one of the algorithms fitting this problem. This method counts how many shortest paths are going thru each vertex from all shortest paths connecting any pair of other vertices.

As far as I know, GraphFrame does not have Betweenness centrality and its Shortest Path just provides number of hoops between vertices w/o listing the actual path. Using bfs (Breadth First Search) method can give us reasonable approximation (note: bfs doesn't reflect distance/edge length neither; it also treats each graph as directed):

  • Ensure each vertex is defined in both directions to make bfs treating graph as undirected
  • Declare mutable structure (e.g. ArrayBuffer) pathMembers with following fields [fromId, toId, pathId, vertexId]
  • For each vertex o in your graph g.vertices (outer loop)
    • For each vertex i in your graph g.vertices.filter($"id" < lit(o.id)) (inner loop - looks only into i.id smaller than o.id, because shortestPath(o.id, i.id) is exaclty same as shortestPath(i.id, o.id) in undirected graph)
      • apply val paths = g.bfs.fromExpr("id = " + o.id).toExpr("id = " + i.id).run()
      • transpose paths to store all vertices in the path for each path and store them in pathMembers
  • Calculate how many time was each vertexId present in each fromId, toId path (i.e. vertexId count divided by pathId count for each fromId, toId pair)
  • Sum-up the calculation for each vertexId to obtain betweenness centrality measure

Vertex "X" for the schema will get highest value. Value for vertices directly connected to "X" will drop. Difference will be highes if most of the groups cross-connected by "X" have comparable size.

Note: if your graph is so large the full Betweenness centrality algorithm will be prohibitively long, sub-set of pairs for shortest path calculation could be selected randomly. Sample size is compromise between acceptable processing time and probability choosing majority of pairs within single branch of the graph.

Dan
  • 494
  • 2
  • 14