0

I am using Graphframe LPA to find the communities but somehow it's not giving me expected result

graph_data = spark.createDataFrame([
  ("a", "d", "friend"),
  ("b", "d", "friend"),
  ("c", "d", "friend")
], ["src", "dst", "relationship"])

here my requirement is to get single community id for all vertices a,b,c and d but i am getting two different community id one for a,b,c and one for d code:

df1 = graph_data.selectExpr('src AS id')

df2 = graph_data.selectExpr('dst AS id')

vertices = df1.union(df2)

vertices = vertices.distinct()

edges = graph_data

g = GraphFrame(vertices, edges)

communities = g.labelPropagation(maxIter=5) 
Simon Long
  • 1,310
  • 4
  • 20
  • 39

1 Answers1

0

Given that d is a root it has a separate label. To accomplish a single label, recommend using connected components instead, see docs.

communities = g.connectedComponents()

Note: requires you set a checkpoint directory prior.

sc.setCheckpointDir("some_path")