zipWithIndex on MapPartitionsRDD

Question

I have the words which is org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[11] at map which looks like

Array(Array(cyber crimes, cyber security, review, india, instances, state, issue), Array(civil society, instances, frequency))

Now after performing flatMap and distinct on the above to get all distinct words from RDD I get

scala> val uniquewords = words.flatMap(_.distinct)
res17: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[20] at flatMap at <console>:30

scala> uniquewords.take(10)
res18: Array[String] = Array(cyber crimes, cyber security, review, india, instances, state, issue, civil society, frequency)

Now with I am performing zipWithIndex on the I am getting ERROR

scala> uniquewords.zipWithIndex
17/05/07 09:40:09 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 17)
java.lang.NullPointerException
    at $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:27)
    at $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:27)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
    at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
    at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
17/05/07 09:40:09 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 17, localhost, executor driver): java.lang.NullPointerException
    at $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:27)
    at $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:27)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
    at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
    at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

17/05/07 09:40:09 ERROR TaskSetManager: Task 0 in stage 14.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 17, localhost, executor driver): java.lang.NullPointerException
    at $anonfun$1.apply(<console>:27)
    at $anonfun$1.apply(<console>:27)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
    at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
    at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
  at org.apache.spark.rdd.ZippedWithIndexRDD.<init>(ZippedWithIndexRDD.scala:50)
  at org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293)
  at org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.zipWithIndex(RDD.scala:1292)
  ... 48 elided
Caused by: java.lang.NullPointerException
  at $anonfun$1.apply(<console>:27)
  at $anonfun$1.apply(<console>:27)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
  at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
  at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
  at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

My problem statement is almost similar to this but the solution is not applicable to me I suppose. Is there is different way to handle MapPartitionsRDD ?

One of your String objects is `null`, Try to filter out all null Strings before the flatMap. — Zouzias, May 07 '17 at 20:40

Matti Lyra · Accepted Answer · 2017-05-07T11:09:42.190

1

Where did the MapPartitionsRDD come from this works without any problems

val rdd = sc.parallelize(Array[Array[String]](Array[String]("cyber", "india", "fourteen"), Array[String]("crime", "india", "twelve")))
rdd.flatMap(_.distinct).zipWithIndex.collect

Array((cyber,0), (india,1), (fourteen,2), (crime,3), (india,4), (twelve,5))

so there has to be something else at play here. Can you create a minimal working example that reproduces the error. I'm guessing there's some empty rows in your RDD that you should be filtering away, that was always the case when encountered a similar error. Those empty rows are producing the NullPointerException (I think), probably from trying to call .distinct on them. The error is produced from an anon function which implies that it's some anonymous function you're passing into a map or flatMap - difficult to say exactly as that's not a complete example.

Double check your data ingestion and verify that the RDD contains what you think it contains.

edited May 07 '17 at 11:09

answered May 07 '17 at 10:20

Matti Lyra

12,828
8
49
67

Thanks I was able to perform the operation. There were some empty rows In my `RDD` (which I got by converting `DataFrame` to `RDD`). Can you please elaborate that it coudn't handle empty rows ? – analyticalpicasso May 07 '17 at 10:54
I'm not sure what I should elaborate, it's the empty rows that are producing the `NullPointerException`, why they are there very much depends on how your data preprocessing works. For me these were HTML files with only images on them and my NLP pipeline would just return an empty result in those cases, this is extremely painful to debug in spark though. – Matti Lyra May 07 '17 at 11:06
Yeah I got it after putting the comment. Sorry about asking so silly question. Thanks :) – analyticalpicasso May 07 '17 at 11:27

zipWithIndex on MapPartitionsRDD

1 Answers1