2

code below:

def main(args: Array[String]) {
    val sc = new SparkContext
    val sec = Seconds(3)
    val ssc = new StreamingContext(sc, sec)
    ssc.checkpoint("./checkpoint")
    val rdd = ssc.sparkContext.parallelize(Seq("a","b","c"))
    val inputDStream = new ConstantInputDStream(ssc, rdd)

    inputDStream.transform(rdd => {
        val buf = ListBuffer[String]()
        buf += "1"
        buf += "2"
        buf += "3"
        val other_rdd = ssc.sparkContext.parallelize(buf)   // create a new rdd
        rdd.union(other_rdd)
    }).print()

    ssc.start()
    ssc.awaitTermination()
}

and throw exception:

java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.streaming.StreamingContext
Serialization stack:
    - object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext@5626e185)
    - field (class: com.mirrtalk.Test$$anonfun$main$1, name: ssc$1, type: class org.apache.spark.streaming.StreamingContext)
    - object (class com.mirrtalk.Test$$anonfun$main$1, <function1>)
    - field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21, name: cleanedF$2, type: interface scala.Function1)
    - object (class org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21, <function2>)
    - field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5, name: cleanedF$3, type: interface scala.Function2)
    - object (class org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5, <function2>)
    - field (class: org.apache.spark.streaming.dstream.TransformedDStream, name: transformFunc, type: interface scala.Function2)

when I remove code ssc.checkpoint("./checkpoint"), the application can work well, but I need enable checkpoint.

how to fix this issue when enable checkpoint?

user7337271
  • 1,662
  • 1
  • 14
  • 23
Guo
  • 1,761
  • 2
  • 22
  • 45

1 Answers1

2

You can move context initialization and configuration tasks outside main:

object App {
  val sc = new SparkContext(new SparkConf().setAppName("foo").setMaster("local"))
  val sec = Seconds(3)
  val ssc = new StreamingContext(sc, sec)
  ssc.checkpoint("./checkpoint") // enable checkpoint

  def main(args: Array[String]) {
    val rdd = ssc.sparkContext.parallelize(Seq("a", "b", "c"))
    val inputDStream = new ConstantInputDStream(ssc, rdd)

    inputDStream.transform(rdd => {
      val buf = ListBuffer[String]()
      buf += "1"
      buf += "2"
      buf += "3"
      val other_rdd = ssc.sparkContext.parallelize(buf)
      rdd.union(other_rdd) // I want to union other RDD
    }).print()

    ssc.start()
    ssc.awaitTermination()
  }
}
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 2
    Isn't the issue the fact that a `StreamingContext` can't be serialized, and he's using it inside a transformation? – Yuval Itzchakov Jul 22 '16 at 10:13
  • @YuvalItzchakov This was my first thought but it is not used in transformation (it is used only at the stream level) so it is not direct issue. It looks like the problem is more subtle here with .StreamingContext being dragged down during checkpoint. – zero323 Jul 22 '16 at 10:23
  • Is `transform` invoked on the driver side or the worker side? – Yuval Itzchakov Jul 22 '16 at 10:33
  • 1
    @YuvalItzchakov Driver side. And it [is using contexts internally anyway](https://github.com/apache/spark/blob/a95252823e09939b654dd425db38dadc4100bc87/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala#L673-L683). – zero323 Jul 22 '16 at 10:37
  • 2
    Since `transform` take an `RDD` as input, it logically cannot be run on the worker. You can only access an `RDD` on the driver. – Kien Truong Jul 22 '16 at 10:43
  • @zero323 Thank you very much! It works well now! But I don't know what's the difference of context initialization and configuration tasks outside `main` or inside `main`? – Guo Jul 22 '16 at 14:56
  • Is this way suitable for Production Environment? – Guo Jul 22 '16 at 15:12
  • @zero323 Could you tell me what's the difference of outside `main` and inside `main`? I can't find any documents or posts to describe this situation. I really want to know. please! – Guo Jul 23 '16 at 11:49
  • `main`, as any other function is just an object, here of type `Function1[Array[String], Unit] `. So every value in its body is its member. In second case it is a member of enclosing object. – zero323 Jul 23 '16 at 12:15
  • @zero323 Thanks again! I think I can probably understand. – Guo Jul 24 '16 at 01:11
  • I have a very similar issue in Java code, I do have Spark Object initiated in a different class and not in the same class, but I still see this error. Please have a look:http://stackoverflow.com/questions/41747725/sparkstreaming-creating-rdd-and-doing-union-in-a-transform-operation-with-ssc-ch?noredirect=1#comment70688308_41747725 – tsar2512 Jan 19 '17 at 17:24