I need to use a previously trained machine learning model in order to make predictions. However, I need to make predictions inside foreachRDD
, because input data vecTest
is passed through different transformations and if-then
rules. In order to avoid serialization issue, I tried using broadcasting. My code is given below. Nevertheless, I still get the serialization error. Any help is highly welcome.
val model = GradientBoostedTreesModel.load(sc,pathToModel)
val model_sc = sc.broadcast(model)
myDSTREAM.foreachRDD(rdd => {
rdd.foreachPartition({ partitionOfRecords =>
//...
val prediction_result = model_sc.value.predict(vecTest)
})
})
UPDATE:
I tried using Kryo serialization, but still unsuccessfully.
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[GradientBoostedTreesModel]))
UPDATE:
If I run this code, I get the error (see stacktrace below):
myDSTREAM.foreachRDD(rdd => {
rdd.foreachPartition({ partitionOfRecords =>
val model = GradientBoostedTreesModel.load(sc,pathToModel)
partitionOfRecords.foreach(s => {
//...
val vecTestRDD = sc.parallelize(Seq(vecTest))
val prediction_result = model.predict(vecTestRDD)
})
})
})
17/03/17 13:11:00 ERROR JobScheduler: Error running job streaming job 1489752660000 ms.0
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:919)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
at org.test.classifier.Predictor$$anonfun$run$2.apply(Predictor.scala:210)
at org.test.classifier.Predictor$$anonfun$run$2.apply(Predictor.scala:209)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.test.classifier.Predictor
Serialization stack:
- object not serializable (class: org.test.classifier.Predictor, value: org.test.classifier.Predictor@26e949f7)
- field (class: org.test.classifier.Predictor$$anonfun$run$2, name: $outer, type: class org.test.classifier.Predictor)
- object (class org.test.classifier.Predictor$$anonfun$run$2, <function1>)
- field (class: org.test.classifier.Predictor$$anonfun$run$2$$anonfun$apply$4, name: $outer, type: class org.test.classifier.Predictor$$anonfun$run$2)
- object (class org.test.classifier.Predictor$$anonfun$run$2$$anonfun$apply$4, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
UPDATE 3:
I tried another approach, but again the same problem:
val model = GradientBoostedTreesModel.load(sc,mySet.value("modelAddress") + mySet.value("modelId"))
val new_dstream = myDStream.map(session => {
val features : Array[String] = UtilsPredictor.getFeatures()
val parsedSession = UtilsPredictor.parseJSON(session)
var input: String = ""
var count: Integer = 1
for (i <- 0 until features.length) {
if (count < features.length) {
input += parsedSession(features(i)) + ","
count += 1
}
else {
input += parsedSession(features(i))
}
}
input = "[" + input + "]"
val vecTest = Vectors.parse(input)
parsedSession + ("prediction_result" -> model.predict(vecTest).toString)
})