1

I was trying to use connectedComponents() from graphframes in pyspark to compute the connected components for a reasonably big graph with roughly 1800K vertices and 500k edges.

edgeDF.printSchema()
root
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)


vertDF.printSchema()
root
 |-- id: string (nullable = true)

vertDF.count()
1879806

edgeDF.count()
452196

custGraph = gf.GraphFrame(vertDF,edgeDF)

comp = custGraph.connectedComponents()

The task is not over even after 6 hours. I am running pyspark on a single machine with windows

a. Is it feasible to do such a computation in the given set up?

b. I was getting warning messages like following

[rdd_73_2, rdd_90_2]
[Stage 21:=========>        (2 + 2) / 4][Stage 22:>                 (0 + 2) / 4]16/10/13 01:28:42 WARN Executor: 2 block locks were not released by TID = 632:

[rdd_73_0, rdd_90_0]
[Stage 21:=============>    (3 + 1) / 4][Stage 22:>                 (0 + 3) / 4]16/10/13 01:28:43 WARN Executor: 2 block locks were not released by TID = 633:

[rdd_73_1, rdd_90_1]
[Stage 37:>                 (0 + 4) / 4][Stage 38:>                 (0 + 0) / 4]16/10/13 01:28:47 WARN Executor: 3 block locks were not released by TID = 844:

[rdd_90_0, rdd_104_0, rdd_107_0]

What does this mean?

c. How can we specify that the graph is undirected in graphframe? Do we need to add edges in both directions?

vikruj
  • 21
  • 5
  • Doesn't connected components automatically treat the graph as undirected? I don't think you need to worry about (c). – Nick Chammas Oct 18 '16 at 02:50
  • Regarding (b), you may want to follow this issue on the GraphFrames tracker: https://github.com/graphframes/graphframes/issues/116 – Nick Chammas Oct 19 '16 at 20:54

0 Answers0