0

I'd like to run a Spark Streaming application with Kafka as the data source. It works fine in local but fails in cluster. I'm using spark 1.6.2 and Scala 2.10.6.

Here are the source code and the stack trace.

DevMain.scala

object DevMain extends App with Logging {

1.val lme: RawMetricsExtractor = new JsonExtractor[HttpEvent](props, topicArray)

2 val broadcastLme=sc.broadcast(lme)

3.  val lines: DStream[MetricTypes.InputStreamType] = myConsumer.createDefaultStream()  
4.  lines.foreachRDD { rdd =>
5.    if ((rdd != null) && (rdd.count() > 0) && (!rdd.isEmpty())) {
6.      logInfo("filteredLines: " + rdd.count())
7.      logInfo("start loop")
8.      val le = broadcastLme.value
        rdd.foreach(x => lme.aParser(x).get)
9.      logInfo("end loop")
10.    }   
11.  }
12. lines.print(10)

} I'm getting a NullPointerException at line 6 and the code doesn't enter lme.parser.

This is lme.parser:

class JsonExtractor [T <: SpecificRecordBase : Manifest]
(props:java.util.Properties, topicArray:Array[String])
  extends java.io.Serializable with RawMetricsExtractor with TitaniumConstants with Logging {

    def aParser(x: MetricTypes.InputStreamType): Option[MetricTypes.RawMetricEntryType] = {

      logInfo("jUtils: " + jUtils)
      logInfo("jFactory: " + jsonF)

      if(x == null) {
        logInfo("x is null: " + jUtils)
        return None
      }
}

i have log on line1 of lme.parser and it does not get printed and it does not enter lem.parser.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 8 times, most recent failure: Lost task 0.7 in stage 11.0 (TID 118, dev-titanium-os-wcdc-spark-4.traxion.xfinity.tv): java.lang.NullPointerException
    at DevMain$$anonfun$4$$anonfun$apply$3.apply(DevMain.scala:6)
    at DevMain$$anonfun$4$$anonfun$apply$3.apply(DevMain.scala:6)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:912)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:910)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.foreach(RDD.scala:910)
    at DevMain$$anonfun$4.apply(DevMain.scala:6)
    at DevMain$$anonfun$4.apply(DevMain.scala:6)
    at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
    at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
    at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
    at scala.util.Try$.apply(Try.scala:161)
    at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
    at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
    at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
    at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
    at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at DevMain$$anonfun$4$$anonfun$apply$3.apply(DevMain.scala:6)
    at DevMain$$anonfun$4$$anonfun$apply$3.apply(DevMain.scala:3)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)

... 3 more

this is the new exception after the broadcast variable changes

org.apache.spark.serializer.SerializationDebugger logWarning - Exception in serialization debugger
java.lang.NullPointerException
    at java.text.DateFormat.hashCode(DateFormat.java:739)
    at scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:391)
    at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
    at scala.collection.mutable.FlatHashTable$class.findEntryImpl(FlatHashTable.scala:123)
    at scala.collection.mutable.FlatHashTable$class.containsEntry(FlatHashTable.scala:119)
    at scala.collection.mutable.HashSet.containsEntry(HashSet.scala:41)
    at scala.collection.mutable.HashSet.contains(HashSet.scala:58)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:87)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:206)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:206)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:206)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:206)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:206)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
    at org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:67)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:203)
    at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
    at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85)
    at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
    at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
    at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1326)
    at DevMain$delayedInit$body.apply(DevMain.scala:8)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:71)
    at scala.App$$anonfun$main$1.apply(App.scala:71)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
    at scala.App$class.main(App.scala:71)
    at DevMain$.(DevMain.scala:17)
    at DevMain.main(DevMain.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:558)
[ERROR] 2016-12-26 18:01:23,039 org.apache.spark.deploy.yarn.ApplicationMaster logError - User class threw exception: java.io.NotSerializableException: com.fasterxml.jackson.module.scala.modifiers.SetTypeModifier$
java.io.NotSerializableException: com.fasterxml.jackson.module.scala.modifiers.SetTypeModifier$
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
    at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
    at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:203)
    at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
    at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85)
    at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
    at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
    at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1326)
    at .DevMain$delayedInit$body.apply(DevMain.scala:103)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:71)
    at scala.App$$anonfun$main$1.apply(App.scala:71)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
    at scala.App$class.main(App.scala:71)
    at DevMain$.main(DevMain.scala:17)
    at DevMain.main(DevMain.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:558)
user2359997
  • 561
  • 1
  • 16
  • 40
  • where `lme` is defined (line#6) ? check `lme.aParser(x).get`, here `lme.aParser(x)` might be null – mrsrinivas Dec 26 '16 at 16:54
  • yeah `lme.aParser(x).get` is the cause I suppose , because this code will run on worker and you are not broadcasting it and hence it gives null pointer on worker. Try to broadcast this value and then use it accordingly ! – Shivansh Dec 26 '16 at 16:56
  • def aParser(x: MetricTypes.InputStreamType): Option[MetricTypes.RawMetricEntryType] = { logInfo("jUtils: " + jUtils) logInfo("jFactory: " + jsonF) if(x == null) { logInfo("x is null: " + jUtils) return None } } – user2359997 Dec 26 '16 at 16:57
  • this is the lme.aparser .....i have a log ....here on line1 ...that never get's printed ...... – user2359997 Dec 26 '16 at 16:59
  • Can you show the code where you define `lme`? `(rdd != null)` and `rdd.count()` are not needed (and the latter triggers a Spark job, i.e. adds up to the execution time and makes your app slightly slower). – Jacek Laskowski Dec 26 '16 at 18:01
  • thanks for your comment will surely remove them – user2359997 Dec 26 '16 at 18:07
  • i added the code where lme is defined – user2359997 Dec 26 '16 at 18:11

1 Answers1

1

Yeah lme.aParser(x).get is the cause I suppose , because this code will run on worker and you are not broadcasting lme object it and hence it gives null pointer on worker.

Try to broadcast this value and then use it accordingly !

Something this would work :

val broadcaseLme=sc.broadcast(lme)
val lines: DStream[MetricTypes.InputStreamType] = myConsumer.createDefaultStream()  
             lines.foreachRDD(rdd => {
            if ((rdd != null) && (rdd.count() > 0) && (!rdd.isEmpty()) ) {
              logInfo("filteredLines: " + rdd.count())
              logInfo("start loop")
              rdd.foreach{x => 
                 val lme = broadcastLme.value    
                 lme.aParser(x).get
                  }
              logInfo("end loop")
            }   })

          lines.print(10)
Shivansh
  • 3,454
  • 23
  • 46
  • How can you know if `lme` can be broadcast? What if it uses non-Serializable objects? Without knowing more about `lme` I'd hardly recommend a broadcast variable. – Jacek Laskowski Dec 26 '16 at 18:02
  • Thanks ....Srivastava for pointing me in the right direction ...after adding broadcast variable ...i'm getting serialization errors – user2359997 Dec 26 '16 at 18:13
  • Jacek could you please point me how to overcome the serialization exceptions – user2359997 Dec 26 '16 at 18:34
  • @user2359997: Can you please point out how your lme looks like! is it a serializable object ! – Shivansh Dec 27 '16 at 04:55
  • @ShivanshSrivastava could you please point how would i pass the broadcast variable if instead of foreachRdd we want to use – user2359997 Dec 27 '16 at 06:24
  • @ShivanshSrivastava could you please point how would i pass the broadcast variable if instead of foreachRdd we want to use lines.map(lme.parser) – user2359997 Dec 27 '16 at 06:25