2

I encountered the following exception: Exception in thread "main" java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable

I have enabled checkpoting outside, and use this class process something. And it said that this class not serializablbe:

class EventhubsStateTransformComponent(inStream: DStream[EventhubsEvent]) extends PipelineComponent with Logging{
    def process() = {
        inStream.foreachRDD(rdd => {
            if (rdd.isEmpty()) {
                logInfo("Extract outstream is empty...")
            } else {
                logInfo("Extract outstream is not empty...")
            }
        })
        // TODO eventhubsId is hardcode
        val eventhubsId = "1"
        val statePairStream = inStream.map(eventhubsEvent => ((eventhubsId, eventhubsEvent.partitionId), eventhubsEvent.eventOffset))
        val eventhubsEventStateStream = statePairStream.mapWithState(StateSpec.function(EventhubsStreamState.updateStateFunc _))
        val snapshotStateStream = eventhubsEventStateStream.stateSnapshots()
        val out = snapshotStateStream.map(state =>  {
            (state._1._1, state._1._2, state._2, System.currentTimeMillis() / 1000)
        })
        outStream = out
    }
}

P.S EventhubsEvent is a case class.

=======================================================

New edited: After I make this class extends Serialzable, the exception disappeared. But I wonder what case we need to make our own class extends Serializable. Does it mean that if a class has foreachRDD operation, it will trigger checkpoint to validate code and it need the whole object which contains foreachRDD operation to be Serializable? Because in my memory, some case just require objects in foreachRDD scope need to be serializable.

Serialization stack:
    - object not serializable (class: com.testdm.spark.streaming.etl.common.pipeline.EventhubsStateTransformComponent, value: com.testdm.spark.streaming.etl.common.pipeline.EventhubsStateTransformComponent@2a92a7fd)
    - field (class: com.testdm.spark.streaming.etl.common.pipeline.EventhubsStateTransformComponent$$anonfun$process$1, name: $outer, type: class com.testdm.spark.streaming.etl.common.pipeline.EventhubsStateTransformComponent)
    - object (class com.testdm.spark.streaming.etl.common.pipeline.EventhubsStateTransformComponent$$anonfun$process$1, <function1>)
    - field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, name: cleanedF$1, type: interface scala.Function1)
    - object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, <function2>)
    - writeObject data (class: org.apache.spark.streaming.dstream.DStream)
    - object (class org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream@3e1cb83b)
    - element of array (index: 0)
    - array (class [Ljava.lang.Object;, size 16)
    - field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
    - object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@3e1cb83b, org.apache.spark.streaming.dstream.ForEachDStream@46034134))
    - writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
    - object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [
0 checkpoint files])
    - writeObject data (class: org.apache.spark.streaming.dstream.DStream)
    - object (class org.apache.spark.streaming.dstream.PluggableInputDStream, org.apache.spark.streaming.dstream.PluggableInputDStream@5066ad14)
    - writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
    - object (class org.apache.spark.streaming.dstream.DStreamCheckpointData

    //....
Jaming LAM
  • 141
  • 3
  • 8
  • Does `EventHub` contain any fields which may not be serializable? – Yuval Itzchakov Aug 25 '16 at 09:07
  • @YuvalItzchakov Hi, You mean EventhubsEvent? it s a nested case class with some primitive type and scala Option such as Option[String], Option[] – Jaming LAM Aug 25 '16 at 09:15
  • Did you dig into the StackTrace? Spark tells you exactly which field is causing trouble – Yuval Itzchakov Aug 25 '16 at 09:24
  • @YuvalItzchakov Hi, I added a piece of Serialization Stack info, but it seems hard to narrow down which field cause this problem – Jaming LAM Aug 25 '16 at 09:39
  • What is `EventhubsStateTransformComponent`? – Yuval Itzchakov Aug 25 '16 at 16:51
  • @YuvalItzchakov it is a class to handle some stream transformation. And I found if I put foreachRDD into this class, it seems to trigger checkpoint validation and try to serialize the class object, then throw this exception. I use foreachRdd in this class just for debugging and I feel confused about when we need to consider serialization problem in spark streaming and how to work around it. – Jaming LAM Aug 26 '16 at 04:45

1 Answers1

0

From Serialization stack:

  • object not serializable (class: com.testdm.spark.streaming.etl.common.pipeline.EventhubsStateTransformComponent, value: com.testdm.spark.streaming.etl.common.pipeline.EventhubsStateTransformComponent@2a92a7fd)
  • field (class: com.testdm.spark.streaming.etl.common.pipeline.EventhubsStateTransformComponent$$anonfun$process$1, name: $outer, type: class com.testdm.spark.streaming.etl.common.pipeline.EventhubsStateTransformComponent)
  • object (class com.testdm.spark.streaming.etl.common.pipeline.EventhubsStateTransformComponent$$anonfun$process$1, )

The name show which object are not serialized,so the outer is the field you should check where you use it. Some object are not serializable and try use it in driver or in executor but not pass it from driver to other function which execute in executor.

Matiji66
  • 709
  • 7
  • 14