3

I have a weighted graph with ~340k vertices and ~772k edges. I build an edge and vertices RDD from a file on HDFS.

val verticesRDD : RDD[(VertexId, Long)]

val edgesRDD : RDD[Edge[Double]]

From these RDDs I create a graph using the .apply method.

val my_graph: Graph[(Long),Double] = Graph.apply(verticesRDD, edgesRDD)

I then compute a shortest path algorithm for a range of inputs. This works well on a single node implementation. However, when I run in cluster mode with multiple nodes, I am not seeing a speed up or hardware utilisation.

Reading the documentation, I see that "GraphX provides several ways of building a graph from a collection of vertices and edges in an RDD or on disk. None of the graph builders repartitions the graph’s edges by default; instead, edges are left in their default partitions (such as their original blocks in HDFS)."

Thus, it makes sense that I am not seeing a speed up as the edges are left in their original default partition, on HDFS.

I then tried the partitionBy(PartitionStrategy.RandomVertexCut) method but this obviously does not help with repartitioning edges.

I found there is a minEdgePartitions argument for constructing a graph using the fromEdgeTuples method.

How do I partition edges with the graph.apply constructor method?

LearningSlowly
  • 8,641
  • 19
  • 55
  • 78
  • Have you already tried **partitionBy(PartitionStrategy.EdgePartition2D)**? Edges are assigned to the partitions using both the source vertex and the destination vertex. – Umberto Griffo Dec 12 '16 at 09:21
  • All possible partitioning strategies can be found in the following link: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.PartitionStrategy$ – Daniel de Paula Dec 17 '16 at 18:56

1 Answers1

0

The minEdgePartitions parameter used by fromEdgeTuples is passed to its RDD builder, so here what you should do to get the same result (partitioned edges) is to first build a partitioned edgeRDD than pass it to graph.apply.

val parts = 100
val edgesRDD : RDD[Edge[Double]] = sc.textFile[Edge[Double]]]("/path/to/file", 
                                        minPartitions = parts)
val verticesRDD : RDD[(VertexId, Long)]
val my_graph: Graph[(Long),Double] = Graph.apply(verticesRDD, edgesRDD)
PhiloJunkie
  • 1,111
  • 4
  • 13
  • 27