pyspark graphframes to find connected components of a large graph

Question

I was trying to use connectedComponents() from graphframes in pyspark to compute the connected components for a reasonably big graph with roughly 1800K vertices and 500k edges.

edgeDF.printSchema()
root
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)


vertDF.printSchema()
root
 |-- id: string (nullable = true)

vertDF.count()
1879806

edgeDF.count()
452196

custGraph = gf.GraphFrame(vertDF,edgeDF)

comp = custGraph.connectedComponents()

The task is not over even after 6 hours. I am running pyspark on a single machine with windows

a. Is it feasible to do such a computation in the given set up?

b. I was getting warning messages like following

[rdd_73_2, rdd_90_2]
[Stage 21:=========>        (2 + 2) / 4][Stage 22:>                 (0 + 2) / 4]16/10/13 01:28:42 WARN Executor: 2 block locks were not released by TID = 632:

[rdd_73_0, rdd_90_0]
[Stage 21:=============>    (3 + 1) / 4][Stage 22:>                 (0 + 3) / 4]16/10/13 01:28:43 WARN Executor: 2 block locks were not released by TID = 633:

[rdd_73_1, rdd_90_1]
[Stage 37:>                 (0 + 4) / 4][Stage 38:>                 (0 + 0) / 4]16/10/13 01:28:47 WARN Executor: 3 block locks were not released by TID = 844:

[rdd_90_0, rdd_104_0, rdd_107_0]

What does this mean?

c. How can we specify that the graph is undirected in graphframe? Do we need to add edges in both directions?

Doesn't connected components automatically treat the graph as undirected? I don't think you need to worry about (c). — Nick Chammas, Oct 18 '16 at 02:50
Regarding (b), you may want to follow this issue on the GraphFrames tracker: https://github.com/graphframes/graphframes/issues/116 — Nick Chammas, Oct 19 '16 at 20:54

pyspark graphframes to find connected components of a large graph

0 Answers0