How to use Spark broadcasting to avoid serialization error with machine learning models?

Question

I need to use a previously trained machine learning model in order to make predictions. However, I need to make predictions inside foreachRDD, because input data vecTest is passed through different transformations and if-then rules. In order to avoid serialization issue, I tried using broadcasting. My code is given below. Nevertheless, I still get the serialization error. Any help is highly welcome.

val model = GradientBoostedTreesModel.load(sc,pathToModel)
val model_sc = sc.broadcast(model)

myDSTREAM.foreachRDD(rdd => {
  rdd.foreachPartition({ partitionOfRecords =>
     //...
     val prediction_result = model_sc.value.predict(vecTest)
  })
})

UPDATE:

I tried using Kryo serialization, but still unsuccessfully.

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[GradientBoostedTreesModel]))

UPDATE:

If I run this code, I get the error (see stacktrace below):

myDSTREAM.foreachRDD(rdd => {
  rdd.foreachPartition({ partitionOfRecords =>
     val model = GradientBoostedTreesModel.load(sc,pathToModel)
     partitionOfRecords.foreach(s => {
        //...
        val vecTestRDD = sc.parallelize(Seq(vecTest))
        val prediction_result = model.predict(vecTestRDD)
     })
  })
})

17/03/17 13:11:00 ERROR JobScheduler: Error running job streaming job 1489752660000 ms.0
org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:919)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
    at org.test.classifier.Predictor$$anonfun$run$2.apply(Predictor.scala:210)
    at org.test.classifier.Predictor$$anonfun$run$2.apply(Predictor.scala:209)
    at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
    at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
    at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
    at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
    at scala.util.Try$.apply(Try.scala:161)
    at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
    at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
    at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
    at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
    at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.test.classifier.Predictor
Serialization stack:
    - object not serializable (class: org.test.classifier.Predictor, value: org.test.classifier.Predictor@26e949f7)
    - field (class: org.test.classifier.Predictor$$anonfun$run$2, name: $outer, type: class org.test.classifier.Predictor)
    - object (class org.test.classifier.Predictor$$anonfun$run$2, <function1>)
    - field (class: org.test.classifier.Predictor$$anonfun$run$2$$anonfun$apply$4, name: $outer, type: class org.test.classifier.Predictor$$anonfun$run$2)
    - object (class org.test.classifier.Predictor$$anonfun$run$2$$anonfun$apply$4, <function1>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)

UPDATE 3:

I tried another approach, but again the same problem:

  val model = GradientBoostedTreesModel.load(sc,mySet.value("modelAddress") + mySet.value("modelId"))
  val new_dstream = myDStream.map(session => {
    val features : Array[String] = UtilsPredictor.getFeatures()
    val parsedSession = UtilsPredictor.parseJSON(session)
    var input: String = ""
    var count: Integer = 1
    for (i <- 0 until features.length) {
      if (count < features.length) {
        input += parsedSession(features(i)) + ","
        count += 1
      }
      else {
        input += parsedSession(features(i))
      }
    }
    input = "[" + input + "]"
    val vecTest = Vectors.parse(input)
    parsedSession + ("prediction_result" -> model.predict(vecTest).toString)
  })

if you want to broadcast your model, then it should be serializable. — Hlib, Mar 16 '17 at 13:12
Broadcasting won't help you as all it's doing is preemptively sending the model to the worker nodes as opposed to doing it when necessary. I haven't tried this but how about creating a serializable wrapper around the model (using a case class for example) and writing your own serialization code for the GBM using Json for example. — Phasmid, Mar 16 '17 at 13:19
Have you look at this SO entry? [Can a model be created on Spark batch and use it in Spark streaming?](http://stackoverflow.com/questions/37114302/can-a-model-be-created-on-spark-batch-and-use-it-in-spark-streaming) — riccardo.cardin, Mar 16 '17 at 13:29
@Phasmid Your idea seems to be very useful. Could you please give some relevant link? The only similar solution that comes to my mind is to load model inside `foreachPartition`, but it will be too inefficient. — Dinosaurius, Mar 16 '17 at 15:51
@riccardo.cardin: The link that you provide does not give the solution to my particular issue. I cannot call `sameModel.predict(newData)`. I need to call `predict` inside `foreachRDD` and `foreachPartition`. — Dinosaurius, Mar 16 '17 at 15:53
My suggestion was to load the model directly into the foreachRDD. — riccardo.cardin, Mar 16 '17 at 16:12
@riccardo.cardin: Do you mean this? `myDSTREAM.foreachRDD(rdd => { rdd.foreachPartition({ partitionOfRecords => val model = GradientBoostedTreesModel.load(sc,pathToModel) //... }) })` — Dinosaurius, Mar 16 '17 at 16:14
@Dinosaurius yep, I mean that. But, looking at the documentation, I can see that [GradientBoostedTreesModel](https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/mllib/tree/model/GradientBoostedTreesModel.html) is serializable. So, that object should not be the problem. Can you share the stacktrace? Are you sure you are not using some values declared in the parent scope inside the `foreachRDD` function? — riccardo.cardin, Mar 16 '17 at 21:19
I've got it! The problem should be `vecTest`. This MUST be a `RDD`. So, if you load the content of `vecTest` in the outer scope, you cannot use it inside the `foreachRDD` scope: `RDD` are not serializable. — riccardo.cardin, Mar 16 '17 at 21:22
@riccardo.cardin You are right, `vectTest` is not RDD, it's a `Vector` created as follows `val vecTest = Vectors.parse(input)`, where `input` is `String`. It is executed inside `partitionOfRecords.foreach(s => { ... })`, because I need to apply some if-then rules to each record of RDD. So, do I understand correctly that I should do something like this? `partitionOfRecords.foreach(s => { // ... val vecTestRDD = sc.parallelize(vecTest) val prediction_result = model.predict(vecTestRDD) })` — Dinosaurius, Mar 17 '17 at 09:48
@Dinosaurius I think there is a problem. I can't look at your code, but the `SparkContext` is not serializable. Then, if you're using it into the `partitionOfRecords.foreach` function, it is a problem. Can you confirm this fact? Look at this SO [Unable to serialize SparkContext in foreachRDD](http://stackoverflow.com/questions/38807066/unable-to-serialize-sparkcontext-in-foreachrdd) — riccardo.cardin, Mar 17 '17 at 12:23
@riccardo.cardin: Please see my update with a stacktrace and the code. Yes, I confirm that I use `sc` inside `partitionOfRecords.foreach`. But I cannot find the way to change the whole logic of the code. The problem is that I need to make `model.predict(...)` inside `partitionOfRecords.foreach`. — Dinosaurius, Mar 17 '17 at 12:25
What is `org.test.classifier.Predictor`? This is the class that is genereting the problem. — riccardo.cardin, Mar 17 '17 at 12:26
@riccardo.cardin: My code that I posted is located inside `Predictor.scala` — Dinosaurius, Mar 17 '17 at 12:27
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/138326/discussion-between-riccardo-cardin-and-dinosaurius). — riccardo.cardin, Mar 17 '17 at 12:27
@riccardo.cardin: I tested another approach (see my Update 3). The idea is to simply add a new column "prediction_result" to all entries in `myDStream` and return the new Dstream for further processing. However, again there is Task serialization error. Hopefully in this case it should be easy to solve, because the code is now very simple. — Dinosaurius, Mar 17 '17 at 14:07
@riccardo.cardin: I tried another approach as posted in my new thread here: http://stackoverflow.com/questions/42858696/how-to-add-a-new-column-to-rdd-in-dstream-and-return-new-dstream However, I again get the serialization error though now it seems to be easier to solve, because the code is much shorter. Could you please take a look? — Dinosaurius, Mar 19 '17 at 19:01

How to use Spark broadcasting to avoid serialization error with machine learning models?

0 Answers0