In graphX, how to partition a graph with a custom PartitionStrategy that makes use of its topology?

Question

I want to add a new PartitionStrategy making use of graph topology information. Still, I find the PartitionStrategy only has a function as follows. I can not find any functions that can receive graph data.

  override def getPartition(src: VertexId, dst: VertexId, numParts: PartitionID): PartitionID = {
    println("partitioning!")
    numParts
  }

,this function only can get one peice of src-dst information.

In spark graphx source org.apache.spark.graphx.impl.GraphImpl, I find the code as follow,

  override def partitionBy(
      partitionStrategy: PartitionStrategy, numPartitions: Int): Graph[VD, ED] = {
    val edTag = classTag[ED]
    val vdTag = classTag[VD]
    val newEdges = edges.withPartitionsRDD(edges.map { e =>
      val part: PartitionID = partitionStrategy.getPartition(e.srcId, e.dstId, numPartitions)
      (part, (e.srcId, e.dstId, e.attr))
    }
      .partitionBy(new HashPartitioner(numPartitions))
      .mapPartitionsWithIndex(
        { (pid: Int, iter: Iterator[(PartitionID, (VertexId, VertexId, ED))]) =>
          val builder = new EdgePartitionBuilder[ED, VD]()(edTag, vdTag)
          iter.foreach { message =>
            val data = message._2
            builder.add(data._1, data._2, data._3)
          }
          val edgePartition = builder.toEdgePartition
          Iterator((pid, edgePartition))
        }, preservesPartitioning = true)).cache()
    GraphImpl.fromExistingRDDs(vertices.withEdges(newEdges), newEdges)
  }

,the .partitionBy(new HashPartitioner(numPartitions)) is as follow, partitionBy is from PairRDDFunctions class as follow,

  /**
   * Return a copy of the RDD partitioned using the specified partitioner.
   */
  def partitionBy(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
    if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    if (self.partitioner == Some(partitioner)) {
      self
    } else {
      new ShuffledRDD[K, V, V](self, partitioner)
    }
  }

the HashPartitioner is as follow,

/**
 * A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using
 * Java's `Object.hashCode`.
 *
 * Java arrays have hashCodes that are based on the arrays' identities rather than their contents,
 * so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will
 * produce an unexpected or incorrect result.
 */
class HashPartitioner(partitions: Int) extends Partitioner {
  require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")

  def numPartitions: Int = partitions

  def getPartition(key: Any): Int = key match {
    case null => 0
    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
  }

  override def equals(other: Any): Boolean = other match {
    case h: HashPartitioner =>
      h.numPartitions == numPartitions
    case _ =>
      false
  }

  override def hashCode: Int = numPartitions
}

,but these functions can not get graph data.

I read the PowerGraph distributed_constrained_random_ingress.hpp and powerlyra distributed_hybrid_ingress.hpp code, in the preprocessing phase, they can get the graph, so the graph topology information can be used.

I want to make use of graph topology information, but I don't know how to add a new function in spark to get the graph data, then give every edge a new PartitionID.

I know a solution is to add a new function in `org.apache.spark.graphx.impl.GraphImpl` and override `partitionBy` function. So, I can get graph data and do not affect other codes. When spark users code graph function, they can choose the function to partition graph data. But the solution has a disadvantage that we do not implement PartitionStrategy, and it will make the spark code orderless. — DrowFish19, Nov 19 '19 at 08:38

bonnal-enzo · Answer 1 · 2022-05-29T23:30:19.360

Here is an approach:

Collect the minimal necessary information from your graph
Instanciate a PartitionStrategy that captures this information

As a dummy example, here is a code snippet that partitions graph with the following rule: if the destination is also a source in the graph, then it is assigned to the partition 0, else it is assigned to the partition 1

val graph: Graph[_, _] = [...]

graph.partitionBy(
  new PartitionStrategy {
    // select distinct sources only
    val capturedGraphData: Set[Long] = graph
      .edges
      .map(e => e.srcId)
      .distinct()
      .collect()
      .toSet

    override def getPartition(src: VertexId, dst: VertexId, numParts: PartitionID): PartitionID = {
      if(capturedGraphData.contains(dst)) 0
      else 1
    }
  }
)

Note about scalability: the memory of driver and executors will be in pain if the use case requires a capturedGraphData whose size is too big, this is why it's important to only select the minimal necessary information from the graph, because it will be collected on driver and broadcasted to every executors.

Indeed you are right @AaronZolnai-Lucas, the memory of driver and executors will be in pain if the use case requires a `capturedGraphData` that is huge. I edit to add the warning. — bonnal-enzo, May 29 '22 at 23:22

score 0 · Answer 2 · answered Jan 09 '23 at 08:48

I use this method:(java)

PartitionStrategy newPS = new PartitionStrategy() {
    @Override
    public int getPartition(long src, long dst, int numParts) {
        //      val mixingPrime: VertexId = 1125899906842597L
        //      (math.abs(src * mixingPrime) % numParts).toInt
        Long mixingPrime = 1125899906842597L;
        return (int) Math.abs(src * mixingPrime) % numParts;
    }
};
graph.partitionBy(newPS);

In graphX, how to partition a graph with a custom PartitionStrategy that makes use of its topology?

2 Answers2