Creating array per Executor in Spark and combine into RDD

Question

I am moving from MPI based systems to Apache Spark. I need to do the following in Spark.

Suppose, I have n vertices. I want to create an edge list from these n vertices. An edge is just a tuple of two integers (u,v), no attributes are required.

However, I want to create them in parallel independently in each executor. Therefore, I want to create P edge arrays independently for P Spark Executors. Each array may be of different sizes and depends on the vertices, therefore, I also need the executor id from 0 to n-1. Next, I want to have a global RDD Array of edges.

In MPI, I would create an array in each processor using the processor rank. How do I do that in Spark, especially using the GraphX library?

Therefore, my primary goal is to create an array of edges in each executor and combine them into one single RDD.

I am first trying one modified version of the Erdos--Renyi model. As a parameter I only have the number of nodes n and a probability p.

Suppose, executor i has to process nodes from 101 to 200. For any node say, node 101, it will create edges from 101 to 102 -- n with probability p. After each executor creates the allocated edges, I would instantiate the GraphX EdgeRDD and VertexRDD. Therefore, my plan is to create the edge lists independently in each executor, and merge them into RDD.

I don't know anything about MPI, but from your description I can tell that your way of thinking about your problem is much too "low level". In Spark, you don't worry about which executor is storing what array. Simply create your RDD and Spark will automatically handle distribution and processing of the data. I would also suggest that you go through the documentation of `GraphX` as Vertices and Edges need to be defined in a certain way to be used in `GraphX`. — Glennie Helles Sindholt, Dec 16 '15 at 07:23
Thanks for the suggestions. I am trying to implement a graph generator in parallel. The generator has to create edges in such a way that the computational load will be well balanced. — max, Dec 16 '15 at 15:26

zero323 · Accepted Answer · 2015-12-25T08:21:34.373

Lets start with some imports and variables which will be required for downstream processing:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import scala.util.Random
import org.apache.spark.HashPartitioner

val nPartitions: Integer = ???
val n: Long = ??? 
val p: Double = ???

Next we'll need an RDD of seed IDs which can be used to generate edges. A naive way to handle this would be simply something like this:

sc.parallelize(0L to n)

Since number of the generated edges depends on the node id this approach would give a highly skewed load. We can do a little bit better with repartitioning:

sc.parallelize(0L to n)
  .map((_, None))
  .partitionBy(new HashPartitioner(nPartitions))
  .keys

but much better approach is to start with empty RDD and generate ids in place. We'll need a small helper:

def genNodeIds(nPartitions: Int, n: Long)(i: Int) = {
  (0L until n).filter(_ % nPartitions == i).toIterator
}

which can be used as follows:

val empty = sc.parallelize(Seq.empty[Int], nPartitions)
val ids = empty.mapPartitionsWithIndex((i, _) => genNodeIds(nPartitions, n)(i))

Just a quick sanity check (it is quite expensive so don't use it in production):

require(ids.distinct.count == n)

and we can generate actual edges using another helper:

def genEdgesForId(p: Double, n: Long, random: Random)(i: Long) = {
  (i + 1 until n).filter(_ => random.nextDouble < p).map(j => Edge(i, j, ()))
}

def genEdgesForPartition(iter: Iterator[Long]) = {
  // It could be an overkill but better safe than sorry
  // Depending on your requirement it could worth to
  // consider using commons-math
  // https://commons.apache.org/proper/commons-math/userguide/random.html
  val random = new Random(new java.security.SecureRandom())
  iter.flatMap(genEdgesForId(p, n, random))
}

val edges = ids.mapPartitions(genEdgesForPartition)

Finally we can create a graph:

val graph = Graph.fromEdges(edges, ())

Thanks very much, this is very thorough guide. This will also be very helpful for my understanding of Spark. — max, Dec 18 '15 at 04:33
I've simplified this a little bit. It should also address the problem you've mentioned in the other question. — zero323, Dec 25 '15 at 08:24

Creating array per Executor in Spark and combine into RDD

1 Answers1

Linked