2

I am running a job on Spark in standalone mode, version 1.2.0

The first operation I am doing is taking an RDD of folder paths, and generating an RDD of file names, composed of the files reside in each folder:

JavaRDD<String> filePaths = paths.mapPartitions(new FoldersToFiles()).repartition(defaultPartitions);

Where the inner implementation of FoldersToFiles class is:

@Override
public Iterable<String> call(Iterator<String> pathsIterator) throws Exception {
    List<String> filesPath = new ArrayList<String>();
    if (pathsIterator != null) {
        while (pathsIterator.hasNext()) {
            try {
                String currFolder = pathsIterator.next();
                Path currPath = new Path(currFolder);
                FileSystem fs = FileSystem.get(currPath.toUri(), new Configuration(true));
                FileStatus[] files = fs.listStatus(currPath);
                List<FileStatus> filesList = Arrays.asList(files);
                List<String> filesPathsStr = new Utils().convertFileStatusToPath(filesList);
                filesPath.addAll(filesPathsStr);
            } catch(Exception e) {
                log.error("Error during file names extraction: " + e.getMessage());
            }
        }
    }
    if(filesPath == null || filesPath.isEmpty()) {
        log.error("Warning: files path list is null or empty!! Given Path Iterator is: " + pathsIterator.toString());
    }
    return filesPath;
}

When running the job on cluster, I get the following error:

520983 [task-result-getter-1] WARN org.apache.spark.scheduler.TaskSetManager  - Lost task 33.0 in stage 1.0 (TID 1033, hadoop-w-8.c.taboola-qa-01.internal): java.lang.NullPointerException
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:140)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

So the error is not directly inside my code. However, looking at the relevant line in Spark code:

  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   */
  def mapPartitions[U](f: FlatMapFunction[java.util.Iterator[T], U]): JavaRDD[U] = {
    def fn = (x: Iterator[T]) => asScalaIterator(f.call(asJavaIterator(x)).iterator())
    JavaRDD.fromRDD(rdd.mapPartitions(fn)(fakeClassTag[U]))(fakeClassTag[U])
  }

(line 140 in which the exception happens is the first one)

It probably related to (and this is actually the first mapPartitions in my job, so it makes sense) the code I mentioned above, however I cannot understand why.

Yaniv Donenfeld
  • 565
  • 2
  • 8
  • 26

1 Answers1

0

Just a hunch: maybe the FoldersToFiles class needs to be declared static (if it is a private class)?

pzecevic
  • 2,807
  • 22
  • 21
  • Nope. It is a public class. I have many other such classes working successfully in the same project... – Yaniv Donenfeld Jan 15 '15 at 12:58
  • Well, can you expand that line into a block so that you can see exactly where NullPointer occurs? – pzecevic Jan 15 '15 at 13:17
  • One more thing: you check if (pathsIterator != null), but then call pathsIterator.toString() afterwards. – pzecevic Jan 15 '15 at 13:18
  • And another possibility: f.call can return a null, but you call iterator() on the result. – pzecevic Jan 15 '15 at 13:19
  • Note that the bottom code block is not Yaniv's code, it's in the Spark library. `f.call` is Yaniv's code (seen above), and I don't see how it could return `null`. – Daniel Darabos Jan 15 '15 at 14:49