I have a weighted graph with ~340k vertices and ~772k edges. I build an edge and vertices RDD from a file on HDFS.
val verticesRDD : RDD[(VertexId, Long)]
val edgesRDD : RDD[Edge[Double]]
From these RDDs I create a graph using the .apply
method.
val my_graph: Graph[(Long),Double] = Graph.apply(verticesRDD, edgesRDD)
I then compute a shortest path algorithm for a range of inputs. This works well on a single node implementation. However, when I run in cluster mode with multiple nodes, I am not seeing a speed up or hardware utilisation.
Reading the documentation, I see that "GraphX provides several ways of building a graph from a collection of vertices and edges in an RDD or on disk. None of the graph builders repartitions the graph’s edges by default; instead, edges are left in their default partitions (such as their original blocks in HDFS).
"
Thus, it makes sense that I am not seeing a speed up as the edges are left in their original default partition, on HDFS.
I then tried the partitionBy(PartitionStrategy.RandomVertexCut)
method but this obviously does not help with repartitioning edges.
I found there is a minEdgePartitions
argument for constructing a graph using the fromEdgeTuples
method.
How do I partition edges with the graph.apply
constructor method?