Memory errors in shuffle phase (lost task...) when processing a very big graph with Pregel algorithm

Question

I am executing the Pregel algorithm with Spark GraphX in Scala.

My graph contains 1 million of nodes, and 5 millons of edges between them. My cluster is very powerful, with several servers for BigData, with 256GB of memory each.

I have a "Java Heap Space error" in a shuffle phase during the execution, after more than 20 minutes of processing: task lost... I am going to analyse these parameters:

Analysis of the way I load the Graph and its persistence (StorageLevel)
Analysis of memory used by: nodes, edges and messages sent
Analysis of the parameters set in Spark when using the spark-submit command: number of executors, memory/vcores per executor, serialization...

score 0 · Answer 1 · answered May 01 '16 at 10:23

The conclusion of my analysis and the optimizations I have used:

Parameters of Spark used in spark-submit: 90% used of available YARN memory in the cluster. I used 3 vcores/executor and 3 executors per physical server. I run it with KryoSerializer, to reduce the space of the data stored.
Graph - RDDs of nodes and edges: previously, I created and stored the RDDs of nodes and edges in HDFS, in 1000 files by using coalesce, so that the data is saved uniformly, although it lasts a long time.
Graph - Loading: from existing RDD files in HDFS.
Graph - Nodes and Edges: loaded correctly. Their attributes in scala are only the ones I use (5 attributes each) and with the minimal use of memory (Integers), with no auxiliary attributes.
Graph - Messages merged in the mergeMsg method: I combine the 2 messages, by using my own formula (related to the target of my project).
Graph - vprog: the nodes gather all the information received in the messages and save it in a List of "known good info" inside each node. That info is used to create the
Graph - Messages sent in sendMsg method: each node uses its info (List of known good info and its other attributes) to create the messages to be sent. Also, to reduce the number of messages, I filtered the messages that were not useful, in order not to send them in the Iterator.

I discovered MY MAIN PROBLEM: the List inside each node that saves the "known good info" is immutable.

SOLUTION: I should use ListBuffer (mutable). Also, I should use the method .append() instead of .++(), because this one creates a new instance of the List.

More info of collections in scala: http://docs.scala-lang.org/overviews/collections/performance-characteristics

The performance now is more than 10 times faster and the memory errors do not appear now.

Also, I used the strategy StorageLevel.MEMORY_AND_DISK when loading the graph. — Carlos AG, May 01 '16 at 10:26

Memory errors in shuffle phase (lost task...) when processing a very big graph with Pregel algorithm

1 Answers1

Linked