Strange errors when running a Spark job

Question

I am running a spark cluster with 80 machines. Each machine is a VM with 8-core, and 50GB memory (41 seems to be available to Spark).

I am running on several input folders, I estimate the size of input to be ~250GB gz compressed.

I get errors in the driver log I do not know what to make of. Examples (in the order they appear in the logs):

240884 [Result resolver thread-0] WARN org.apache.spark.scheduler.TaskSetManager  - Lost task 445.0 in stage 1.0 (TID 445, hadoop-w-59.c.taboola-qa-01.internal): java.net.SocketTimeoutException: Read timed out
        java.net.SocketInputStream.socketRead0(Native Method)
        java.net.SocketInputStream.read(SocketInputStream.java:152)
        java.net.SocketInputStream.read(SocketInputStream.java:122)
        java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
        sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
        sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
        org.apache.spark.util.Utils$.fetchFile(Utils.scala:376)
        org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:325)
        org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:323)
        scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
        scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
        scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
        scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
        scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
        scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
        org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:323)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:745)


    271722 [Result resolver thread-3] WARN org.apache.spark.scheduler.TaskSetManager  - Lost task 247.0 in stage 2.0 (TID 883, hadoop-w-79.c.taboola-qa-01.internal): java.lang.NullPointerException: 
            org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:153)
            org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
            org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
            org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
            org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
            org.apache.spark.scheduler.Task.run(Task.scala:54)
            org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
            java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            java.lang.Thread.run(Thread.java:745)


309052 [Result resolver thread-1] WARN org.apache.spark.scheduler.TaskSetManager  - Lost task 272.0 in stage 2.0 (TID 908, hadoop-w-58.c.taboola-qa-01.internal): java.io.IOException: unexpected exception type
        java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538)
        java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1025)
        java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
        java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
        org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:745)


820940 [connection-manager-thread] INFO org.apache.spark.network.ConnectionManager  - key already cancelled ? sun.nio.ch.SelectionKeyImpl@1c827563
java.nio.channels.CancelledKeyException
    at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
    at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)

Since my job class (Phase0) is not part of any of the stack traces, I am not sure what I can learn from these errors on the source of problem. Any suggestions?

EDIT: Specifically, the following exception happens to me even when I work on a few GB folder:

271722 [Result resolver thread-3] WARN org.apache.spark.scheduler.TaskSetManager  - Lost task 247.0 in stage 2.0 (TID 883, hadoop-w-79.c.taboola-qa-01.internal): java.lang.NullPointerException: 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:153)
            org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
            org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
            org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
            org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
            org.apache.spark.scheduler.Task.run(Task.scala:54)
            org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
            java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            java.lang.Thread.run(Thread.java:745)

Which Spark version is this? – Josh Rosen Oct 25 '14 at 00:35 — Josh Rosen, Oct 25 '14 at 00:35
I am running on Spark 1.1 – Yaniv Donenfeld Oct 25 '14 at 05:51 — Yaniv Donenfeld, Oct 25 '14 at 05:51

score 1 · Answer 1 · answered Jan 05 '15 at 20:12

The solution is not specific to the exceptions mentioned here, but eventually I was able to overcome all issue in Spark using the following guidelines:

All machines should be tweaked in terms of ulimit and process memory as follows:

Adding the following to /etc/security/limits.conf:

hadoop soft nofile 900000
root soft nofile 900000
hadoop hard nofile 990000
root hard nofile 990000
hadoop hard memlock unlimited
root hard memlock unlimited
hadoop soft memlock unlimited
root soft memlock unlimited

To /etc/pam.d/common-session and to /etc/pam.d/common-session-noninteractive:

"session required pam_limits.so"

Core usage: If using a VM, I would recommend allocating n-1 cores for Spark, and leaving 1 core for communication and other tasks.
Partitions: I would recommend using number of partitions which is between 5x to 10x of number of used cores in the cluster. If you see "out of memory" errors, you need to increase the number of partitions (first by increasing the factor, and then by adding machines to the cluster)
Output array by key - If you see errors such as: "Array size exceeds VM limit" , you probably have too much data per key, and so you need to decrease the amount of data per key. For example, if you output files according to 1 hour interval, try decreasing to 10 minutes, or even 1 minute interval.
If you still see errors, look for them in Spark bug reports, you might want to make sure you upgrade to latest Spark version. For me, currently version 1.2 fixed a bug which would cause my job to fail.
Use Kryo registrator, convert all your RDD transformation logic to run each inside a separate class, and make sure to register all those classes using Kryo.

Strange errors when running a Spark job

1 Answers1