Error when running Spark on a google cloud instance

Question

I'm running a standalone application using Apache Spark and when I load all my data to a RDD as a textfile I got the following error:

15/02/27 20:34:40 ERROR Utils: Uncaught exception in thread stdout writer for python
java.lang.OutOfMemoryError: Java heap space
   at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
   at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
   at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.<init>(GoogleHadoopFSInputStream.java:81)
   at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.open(GoogleHadoopFileSystemBase.java:764)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
   at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
   at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:233)
   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
   at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
   at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
Exception in thread "stdout writer for python" java.lang.OutOfMemoryError: Java heap space
   at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
   at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
   at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.<init>(GoogleHadoopFSInputStream.java:81)
   at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.open(GoogleHadoopFileSystemBase.java:764)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
   at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
   at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:233)
   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
   at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
   at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)

I thought that was related with the fact I'm caching the whole RDD to memory with the cache function. I haven't noticed any change when I rid off this function from my code. SO I keep getting this error.

My RDD is derived from several text files inside a directory that is located in a google cloud bucket.

Could you help me to solve this error?

Did you manually create the cluster, or did you use something like [bdutil](https://github.com/GoogleCloudPlatform/bdutil)? What version of Spark are you using? — Dennis Huo, Feb 27 '15 at 21:05
@DennisHuo I used bdutil to create the cluster. I'm using spark 1.2.0 — Saulo Ricci, Feb 27 '15 at 22:17
What machine-type are you using? It would also help to peek at the contents of your `/home/hadoop/spark-install/conf/spark-env.sh` and `/home/hadoop/spark-install/conf/spark-defaults.conf` files. — Dennis Huo, Feb 28 '15 at 00:12
@DennisHuo in my case I have just the `/home/hadoop/hadoop-install` in my machine, I mean my master node. I checked and I don't have `/home/hadoop/spark-instal`l subdir. I have so the file `/home/hadoop/hadoop-install/conf/hadoop-env.sh` anyway. Does that one work out for you? — Saulo Ricci, Feb 28 '15 at 00:25
@DennisHuo my master machine and my worker machines are the following type: n1-standard-4 (4 vCPU, 15 GB memory) — Saulo Ricci, Feb 28 '15 at 00:28
Ah, how did you install Spark then? Did you just download it yourself? — Dennis Huo, Feb 28 '15 at 00:37
Yep, basically I just deployed the instances through `bdutil deploy` and I did the following commands in my master instance after downloaded the spark to my home dir: `export SPARK_HOME=~/spark-1.2.0/` `cp /home/hadoop/hadoop-install/conf/core-site.xml $SPARK_HOME/conf/` `cd $SPARK_HOME/lib_managed/jars` `wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar` `export SPARK_CLASSPATH=$SPARK_HOME/lib_managed/jars/gcs-connector-latest-hadoop2.jar` — Saulo Ricci, Feb 28 '15 at 00:45
@DennisHuo I just added my answer to your last suggestion and scenario. Would you mind take a look? And thank you very much for your interest and availability to help me. — Saulo Ricci, Feb 28 '15 at 04:35

Dennis Huo · Answer 1 · 2015-02-28T01:21:19.823

Spark requires a fair bit of configuration tuning depending on cluster size, shape, and workload, and out-of-the-box, probably won't work for realistically-sized workloads.

When using bdutil to deploy, the best way to get Spark is actually to use the officially-supported bdutil plugin, simply with:

./bdutil -e extensions/spark/spark_env.sh deploy

Or equivalently as shorthand:

./bdutil -e spark deploy

This will make sure the gcs-connector and memory settings, etc., are all properly configured in Spark.

You can also theoretically use bdutil to install Spark directly on your existing cluster, though this is less thoroughly-tested:

# After you've already deployed the cluster with ./bdutil deploy:
./bdutil -e spark run_command_group install_spark -t all
./bdutil -e spark run_command_group spark_configure_startup -t all
./bdutil -e spark run_command_group start_spark -t master

This should be the same as if you had just run ./bdutil -e spark deploy originally. If you had deployed with ./bdutil -e my_custom_env.sh deploy then all the above commands need to actually start with ./bdutil -e my_custom_env.sh -e spark run_command_group.

In your case, the relevant Spark memory settings were probably related to spark.executor.memory and/or SPARK_WORKER_MEMORY and/or SPARK_DAEMON_MEMORY

EDIT: On a related note, we just released bdutil-1.2.0 which defaults to Spark 1.2.1, and also adds improved Spark driver memory settings and YARN support.

Error when running Spark on a google cloud instance

1 Answers1