1

I have enabled checkpoint that saves the logs to S3. If there are NO files in the checkpoint directory, spark streaming works fine and I can see log files appearing in the checkpoint directory. Then I kill spark streaming and restart it. This time, I start getting NullPointerException for spark session. In short, if there are NO log files in the checkpoint directory, spark streaming works fine. However as soon as I restart spark streaming WITH log files in the checkpoint directory, I start getting null pointer exception on spark session. Below is the code:

object asf {
  val microBatchInterval = 5
  val sparkSession = SparkSession
    .builder()
    .appName("Streaming")
    .getOrCreate()

    val conf = new SparkConf(true)
    //conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
    val sparkContext = SparkContext.getOrCreate(conf)


  val checkpointDirectory = "s3a://bucketname/streaming-checkpoint"

  println("Spark session: " + sparkSession)

  val ssc = StreamingContext.getOrCreate(checkpointDirectory,
    () => {
      createStreamingContext(sparkContext, microBatchInterval, checkpointDirectory, sparkSession)
    }, s3Config.getConfig())

  ssc.start()
  ssc.awaitTermination()
}

  def createStreamingContext(sparkContext: SparkContext, microBatchInterval: Int, checkpointDirectory: String,spark:SparkSession): StreamingContext = {
    println("Spark session inside: " + spark)
    val ssc: org.apache.spark.streaming.StreamingContext = new StreamingContext(sparkContext, Seconds(microBatchInterval))
    //TODO: StorageLevel.MEMORY_AND_DISK_SER
    val lines = ssc.receiverStream(new EventHubClient(StorageLevel.MEMORY_AND_DISK_SER);
    lines.foreachRDD {
      rdd => {
        val df = spark.read.json(rdd)
        df.show()
      }
    }
    ssc.checkpoint(checkpointDirectory)
    ssc
  }
}  

And again, the very first time I run this code (with No log files in the checkpoint directory), I can see the data frame being printed out. And if I run with log files in the checkpoint directory, I don't even see

println("Spark session inside: " + spark)

getting printed and it IS printed the FIRST time. The error:

Exception in thread "main" java.lang.NullPointerException
    at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:111)
    at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
    at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:549)
    at org.apache.spark.sql.SparkSession.read(SparkSession.scala:605)

And the error is happening at:

val df = spark.read.json(rdd)

Edit: I added this Line:

conf.set("spark.streaming.stopGracefullyOnShutdown","true")

and it still did not make a difference, still getting NullPointerException.

Ahmed
  • 121
  • 6
  • 18
  • Did you change the code between runs? You cannot change code when checkpointing. If that is the case, see related spark documentation. You need to gracefully shutdown, and delete or change checkpoint dir – Michel Lemay Sep 13 '17 at 02:54
  • Every time I ran the job the first time, I would empty s3a://bucketname/streaming-checkpoint. And then I would press Ctrl+C to shut down spark streaming. And then start up again, and i would then get the null pointer exception. And I used the same code in between the runs – Ahmed Sep 13 '17 at 17:30

2 Answers2

1

To answer my own question, this works:

lines.foreachRDD {
  rdd => {
    val sqlContext:SQLContext = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate().sqlContext

    val df = sqlContext.read.json(rdd)
    df.show()
  }
}

Passing a spark session being built from rdd.sparkContext works

Ahmed
  • 121
  • 6
  • 18
  • 1
    Indeed, that explains a lot.. when your code runs on an executor, it does not have access to val `spark`. I'm surprised you didn't get a task not serializable exception though. – Michel Lemay Sep 14 '17 at 11:18
  • 2
    Do dig further, `sessionState` is a transient lazy val. That means there is a slot reserved for the intialization of the variable on each executors. It does not share the same instance as the one on the driver. Since it is lazy, it gets initialized on first use which happens to be when you load back your checkpoint. At that point, it tries to use another val `parentSessionState` which is null on the executors because it is also transient. – Michel Lemay Sep 14 '17 at 11:25
  • The problem actually arises when I add another line of processing by a secondary class after sqlContext.read.json. For example I instantiate this secondary class outside the forEachRDD and one of the parameters it takes is sparkSession upon its instantiation. ANd in this secondary class, I am using the sparkSession to do some spark.sql and stuff. This basically implies I must instantiate this secondary class every time within forEachRDD to get the spark session from the RDD... – Ahmed Sep 14 '17 at 21:35
0

Just to put it explicitly for the benefit of newbies, this is an anti-pattern. Creating Dataset inside a transformation is not allowed!

As Michel mentioned executor wont have access to SparkSession

Jeevan
  • 8,532
  • 14
  • 49
  • 67