I am running a job on Spark in standalone mode, version 1.2.0
The first operation I am doing is taking an RDD of folder paths, and generating an RDD of file names, composed of the files reside in each folder:
JavaRDD<String> filePaths = paths.mapPartitions(new FoldersToFiles()).repartition(defaultPartitions);
Where the inner implementation of FoldersToFiles class is:
@Override
public Iterable<String> call(Iterator<String> pathsIterator) throws Exception {
List<String> filesPath = new ArrayList<String>();
if (pathsIterator != null) {
while (pathsIterator.hasNext()) {
try {
String currFolder = pathsIterator.next();
Path currPath = new Path(currFolder);
FileSystem fs = FileSystem.get(currPath.toUri(), new Configuration(true));
FileStatus[] files = fs.listStatus(currPath);
List<FileStatus> filesList = Arrays.asList(files);
List<String> filesPathsStr = new Utils().convertFileStatusToPath(filesList);
filesPath.addAll(filesPathsStr);
} catch(Exception e) {
log.error("Error during file names extraction: " + e.getMessage());
}
}
}
if(filesPath == null || filesPath.isEmpty()) {
log.error("Warning: files path list is null or empty!! Given Path Iterator is: " + pathsIterator.toString());
}
return filesPath;
}
When running the job on cluster, I get the following error:
520983 [task-result-getter-1] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 33.0 in stage 1.0 (TID 1033, hadoop-w-8.c.taboola-qa-01.internal): java.lang.NullPointerException
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:140)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
So the error is not directly inside my code. However, looking at the relevant line in Spark code:
/**
* Return a new RDD by applying a function to each partition of this RDD.
*/
def mapPartitions[U](f: FlatMapFunction[java.util.Iterator[T], U]): JavaRDD[U] = {
def fn = (x: Iterator[T]) => asScalaIterator(f.call(asJavaIterator(x)).iterator())
JavaRDD.fromRDD(rdd.mapPartitions(fn)(fakeClassTag[U]))(fakeClassTag[U])
}
(line 140 in which the exception happens is the first one)
It probably related to (and this is actually the first mapPartitions in my job, so it makes sense) the code I mentioned above, however I cannot understand why.