5

I am using mesos cluster to deploy spark job (client mode). I have three servers and was able to run spark job. However, after a while (few days), I got the error:

5/11/03 19:55:50 ERROR Executor: Managed memory leak detected; size = 33554432 bytes, TID = 387939
15/11/03 19:55:50 ERROR Executor: Exception in task 2.1 in stage 6534.0 (TID 387939)
java.io.FileNotFoundException: /tmp/blockmgr-3acec504-4a55-4aa8-a3e5-dda97ce5d055/03/temp_shuffle_cb37f147-c055-4014-a6ae-fd505cb49f57 (Too many open files)
    at java.io.FileOutputStream.open(Native Method)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
    at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Now, it causes all of the streaming batches to be queued up and shown as "processing" (4042/streaming/). None of them was able to proceed until I manually restart the spark job and resubmit again.

My spark job just read data from kafka and doing some update into mongo (there is quite a number of update query through; but i config the spark stream duration to about 5 minutes; so it shouldn't cause problem).

After a while, because no job was able to success; the spark-kafka reader starts to show the error:

ERROR Executor: Exception in task 5.3 in stage 7561.0 (TID 392220)
org.apache.spark.SparkException: Couldn't connect to leader for topic bid_inventory 9: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
    at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$connectLeader$1.apply(KafkaRDD.scala:164)
    at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$connectLeader$1.apply(KafkaRDD.scala:164)    

But once restarted, everything start working fine.

Anyone has an idea why it's happening? Thanks.

auxdx
  • 2,313
  • 3
  • 21
  • 25
  • 1
    I get this memory leak error as well, working with dataframes in Spark. I have no idea how to troubleshoot since it provides no information about where or why the leak is occurring. – Paul Dec 23 '15 at 00:01
  • It's a Spark bug [11293:](https://issues.apache.org/jira/browse/SPARK-11293) – wdz Mar 23 '16 at 03:51
  • 2
    Intended for Spark devs i.e. we're not supposed to see this. See http://stackoverflow.com/questions/34359211/debugging-managed-memory-leak-detected-in-spark-1-6-0 – Martin Tapp Oct 19 '16 at 12:20

0 Answers0