Memory issues when running Spark job on relatively large input

Question

I am running a spark cluster with 50 machines. Each machine is a VM with 8-core, and 50GB memory (41 seems to be available to Spark).

I am running on several input folders, I estimate the size of input to be ~250GB gz compressed.

Although it seems to me that the amount and configuration of machines I am using seems to be sufficient, after about 40 minutes of run the job fail, I can see following errors in the logs:

2558733 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager  - Lost task 345.0 in stage 1.0 (TID 345, hadoop-w-3.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: Java heap space

java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

and also:

2653545 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager  - Lost task 122.1 in stage 1.0 (TID 392, hadoop-w-22.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: GC overhead limit exceeded

java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

How do I go about debugging such an issue?

EDIT: I Found the root cause of the problem. It is this piece of code:

    private static final int MAX_FILE_SIZE = 40194304;
    ....
    ....
        JavaPairRDD<String, List<String>> typedData = filePaths.mapPartitionsToPair(new PairFlatMapFunction<Iterator<String>, String, List<String>>() {
            @Override
            public Iterable<Tuple2<String, List<String>>> call(Iterator<String> filesIterator) throws Exception {
                List<Tuple2<String, List<String>>> res = new ArrayList<>();
                String fileType = null;
                List<String> linesList = null;
                if (filesIterator != null) {
                    while (filesIterator.hasNext()) {
                        try {
                            Path file = new Path(filesIterator.next());
                            // filter non-trc files
                            if (!file.getName().startsWith("1")) {
                                continue;
                            }
                            fileType = getType(file.getName());
                            Configuration conf = new Configuration();
                            CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
                            CompressionCodec codec = compressionCodecs.getCodec(file);
                            FileSystem fs = file.getFileSystem(conf);
                            ContentSummary contentSummary = fs.getContentSummary(file);
                            long fileSize = contentSummary.getLength();
                            InputStream in = fs.open(file);
                            if (codec != null) {
                                in = codec.createInputStream(in);
                            } else {
                                throw new IOException();
                            }

                            byte[] buffer = new byte[MAX_FILE_SIZE];

                            BufferedInputStream bis = new BufferedInputStream(in, BUFFER_SIZE);
                            int count = 0;
                            int bytesRead = 0;
                            try {
                                while ((bytesRead = bis.read(buffer, count, BUFFER_SIZE)) != -1) {
                                    count += bytesRead;
                                }
                            } catch (Exception e) {
                                log.error("Error reading file: " + file.getName() + ", trying to read " + BUFFER_SIZE + " bytes at offset: " + count);
                                throw e;
                            }

                            Iterable<String> lines = Splitter.on("\n").split(new String(buffer, "UTF-8").trim());
                            linesList = Lists.newArrayList(lines);

                            // get rid of first line in file

                            Iterator<String> it = linesList.iterator();
                            if (it.hasNext()) {
                                it.next();
                                it.remove();
                            }
                            //res.add(new Tuple2<>(fileType,linesList));
                        } finally {
                            res.add(new Tuple2<>(fileType, linesList));
                        }


                    }

                }
                return res;
            }

Particularly allocating a buffer of size 40M for each file in order to read the content of the file using BufferedInputStream. This causes the stack memory to end at some point.

The thing is:

If I read line by line (which does not require a buffer), it will be very non-efficient read
If I allocate one buffer and reuse it for each file read - is it possible in parallelism sense? Or will it get overwritten by several threads?

Any suggestions are welcome...

EDIT 2: Fixed first memory issue by moving the byte array allocation outside the iterator, so it gets reused by all partition elements. But there is still the new String(buffer, "UTF-8").trim()) which gets created for the split purpose - that's an object that gets also created every time. I could use a stringbuffer/builder but then how would I set the charset encoding without a String object ?

It seems like you are reading the content of all input files into an in-memory ArrayList? This sort of defeats the purpose of working with RDDs/partitions, or am I missing something? — Eric Eijkelenboom, Oct 23 '14 at 17:57

score 1 · Answer 1 · answered Oct 24 '14 at 22:44

Eventually I changed the code as follows:

       // Transform list of files to list of all files' content in lines grouped by type
        JavaPairRDD<String,List<String>> typedData = filePaths.mapToPair(new PairFunction<String, String, List<String>>() {
            @Override
            public Tuple2<String, List<String>> call(String filePath) throws Exception {
                Tuple2<String, List<String>> tuple = null;
                try {
                    String fileType = null;
                    List<String> linesList = new ArrayList<String>();
                    Configuration conf = new Configuration();
                    CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
                    Path path = new Path(filePath);
                    fileType = getType(path.getName());
                    tuple = new Tuple2<String, List<String>>(fileType, linesList);

                    // filter non-trc files
                    if (!path.getName().startsWith("1")) {
                        return tuple;
                    }

                    CompressionCodec codec = compressionCodecs.getCodec(path);
                    FileSystem fs = path.getFileSystem(conf);
                    InputStream in = fs.open(path);
                    if (codec != null) {
                        in = codec.createInputStream(in);
                    } else {
                        throw new IOException();
                    }

                    BufferedReader r = new BufferedReader(new InputStreamReader(in, "UTF-8"), BUFFER_SIZE);
                    // Get rid of the first line in the file
                    r.readLine();

                    // Read all lines
                    String line;
                    while ((line = r.readLine()) != null) {
                        linesList.add(line);
                    }
                } catch (IOException e) { // Filtering of files whose reading went wrong
                    log.error("Reading of the file " + filePath + " went wrong: " + e.getMessage());
                } finally {
                    return tuple;
                }
            }

        });

So now I do not use a buffer in size of 40M but rather build the lines list dynamically using an array list. This solved my current memory issue, but now I got other strange errors failing the job. Will report those in a different question...

Memory issues when running Spark job on relatively large input

1 Answers1