pyspark memory issue :Caused by: java.lang.OutOfMemoryError: Java heap space

Question

Folks,

Am running a pyspark code to read 500mb file from hdfs and constructing a numpy matrix from the content of the file

Cluster Info:

9 datanodes 128 GB Memory /48 vCore CPU /Node

Job config

  conf = SparkConf().setAppName('test') \
                          .set('spark.executor.cores', 4) \
                          .set('spark.executor.memory', '72g') \
                          .set('spark.driver.memory', '16g') \
                          .set('spark.yarn.executor.memoryOverhead',4096 ) \
                          .set('spark.dynamicAllocation.enabled', 'true') \
                          .set('spark.shuffle.service.enabled', 'true') \
        .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
                          .set('spark.driver.maxResultSize',10000) \
                          .set('spark.kryoserializer.buffer.max', 2044) 

    fileRDD=sc.textFile("/tmp/test_file.txt")
    fileRDD.cache
    list_of_lines_from_file = fileRDD.map(lambda line: line.split(" ")).collect()

Error

The Collect piece is spitting outofmemory error.

18/05/17 19:03:15 ERROR client.TransportResponseHandler: Still have 1 
requests outstanding when connection fromHost/IP:53023 is closed
18/05/17 19:03:15 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.lang.OutOfMemoryError: Java heap space

any help is much appreciated.

.collect() method will give result to driver node and i think it is out of memory. Try changing .set('spark.driver.maxResultSize',10000) to higher value. — goks, May 17 '18 at 23:52
I tried setting `.set('spark.driver.maxResultSize',2147483648)` , but still the same error. — Suresh Sethuramaswamy, May 18 '18 at 13:33

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

A little background on this issue

I was having this issue while i run the code through Jupyter Notebook which runs on an edgenode of a hadoop cluster

Finding in Jupyter

since you can only submit the code from Jupyter through client mode,(equivalent to launching spark-shell from the edgenode) the spark driver is always the edgenode which is already packed with other long running daemon processes, where the available memory is always lesser than the memory required for fileRDD.collect() on my file

Worked fine in spark-submit

I put the content from Jupyer to a .py file and invoked the same through spark-submit with same settings Whoa!! , it ran in seconds there, reason being , spark-submit is optimized to choose the driver node from one of the nodes that has required memory free from the cluster .

spark-submit --name  "test_app" --master yarn --deploy-mode cluster --conf spark.executor.cores=4 --conf spark.executor.memory=72g --conf spark.driver.memory=72g --conf spark.yarn.executor.memoryOverhead=8192 --conf spark.dynamicAllocation.enabled=true  --conf spark.shuffle.service.enabled=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=2044 --conf spark.driver.maxResultSize=1g --conf spark.driver.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' --conf spark.executor.extraJavaOptions='-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=2g' test.py

Next Step :

Our next step is to see if Jupyter notebook can submit the spark job to YARN cluster , via a Livy JobServer or a similar approach.