2

In the https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-scala.html standard example:

  • The Strongly Connected Components seem reasonable computationally when looking also at them visually on a drawing.

enter image description here

  • Therefore I am surprised at the Label Propagation for detecting "communities".

enter image description here

What am I missing? [A, D, E] I would have thought would be a community as well from the data and that results would be similar. I tried with more cycles. Label Propagation seems the poor cousin of "clustering".

thebluephantom
  • 16,458
  • 8
  • 40
  • 83

1 Answers1

2

What you have found here is a known phenomenon called label oscillation which occurs when the labels are synchronously updated and (sub)graphs have a bipartite structure (or star graphs). The two communities will endlessly exchange their labels and the LPA will never terminate by itself.

Labels at t: oscillation at t

Labels at t+1: oscillation at t+1

Labels at t+2: oscillation at t

... and so on.

As you have already mentioned this is not really what we would expect from a community detection algorithm as there are no edges within the communities. But this is still a fast algorithm which delivers good results for non bipartite structures. Raghavan proposed a fast alternative which uses asynchronous updating. But this is not yet implemented in Graphframes. Graphframes calls the graphX implementation of LPA (have a look at code code) which uses Pregel (have a look at code code) which is synchronous.

Besides of LPA problem for bipartite structures there is also a general difference between SCC and LPA:

  • SCC community: Every node of a community knows (has an edge) all other nodes within the community.
  • LPA community: For each node of a community there is a path (sequence of edges) within the community to all other nodes within a community.
cronoik
  • 15,434
  • 3
  • 40
  • 78