2

I'm using boto3 to read files from S3, this have shown to be much faster than sc.textFile(...). These files are between 300MB and 1GB approx. The process goes like:

data = sc.parallelize(list_of_files, numSlices=n_partitions) \
    .flatMap(read_from_s3_and_split_lines)

events = data.aggregateByKey(...)

When running this process, I get the exception:

15/12/04 10:58:00 WARN TaskSetManager: Lost task 41.3 in stage 0.0 (TID 68, 10.83.25.233): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:139)
    ... 15 more

Many times, just some tasks crash and the job is able to recover. However, sometimes the whole job crashes after a number of these errors. I haven't been able to find the origin of this problem and seems to appear and disappear depending on number of files I read, exact transformations I apply... It never fails when reading a single file.

hmourit
  • 81
  • 1
  • 2
  • 5

1 Answers1

2

I have encountered similar problem, my investigation showed that the problem was the lack of free memory for Python process. Spark has took all the memory and Python process (where PySpark works) has been crashing.

Some advices:

  1. add some memory to machine,
  2. unpersist unneeded RDDs,
  3. manage memory wiser (add some constraints on Spark memory usage).