0

I am executing the Pregel algorithm with Spark GraphX in Scala.

My graph contains 1 million of nodes, and 5 millons of edges between them. My cluster is very powerful, with several servers for BigData, with 256GB of memory each.

I have a "Java Heap Space error" in a shuffle phase during the execution, after more than 20 minutes of processing: task lost... I am going to analyse these parameters:

  • Analysis of the way I load the Graph and its persistence (StorageLevel)
  • Analysis of memory used by: nodes, edges and messages sent
  • Analysis of the parameters set in Spark when using the spark-submit command: number of executors, memory/vcores per executor, serialization...
Sander de Jong
  • 351
  • 6
  • 18
Carlos AG
  • 1,078
  • 1
  • 12
  • 16

1 Answers1

0

The conclusion of my analysis and the optimizations I have used:

  • Parameters of Spark used in spark-submit: 90% used of available YARN memory in the cluster. I used 3 vcores/executor and 3 executors per physical server. I run it with KryoSerializer, to reduce the space of the data stored.

  • Graph - RDDs of nodes and edges: previously, I created and stored the RDDs of nodes and edges in HDFS, in 1000 files by using coalesce, so that the data is saved uniformly, although it lasts a long time.

  • Graph - Loading: from existing RDD files in HDFS.

  • Graph - Nodes and Edges: loaded correctly. Their attributes in scala are only the ones I use (5 attributes each) and with the minimal use of memory (Integers), with no auxiliary attributes.

  • Graph - Messages merged in the mergeMsg method: I combine the 2 messages, by using my own formula (related to the target of my project).

  • Graph - vprog: the nodes gather all the information received in the messages and save it in a List of "known good info" inside each node. That info is used to create the

  • Graph - Messages sent in sendMsg method: each node uses its info (List of known good info and its other attributes) to create the messages to be sent. Also, to reduce the number of messages, I filtered the messages that were not useful, in order not to send them in the Iterator.

I discovered MY MAIN PROBLEM: the List inside each node that saves the "known good info" is immutable.

SOLUTION: I should use ListBuffer (mutable). Also, I should use the method .append() instead of .++(), because this one creates a new instance of the List.

More info of collections in scala: http://docs.scala-lang.org/overviews/collections/performance-characteristics

The performance now is more than 10 times faster and the memory errors do not appear now.

Carlos AG
  • 1,078
  • 1
  • 12
  • 16