2

I am trying to read data from Kafka using spark structured streaming and predict form incoming data. I'm using model which I have trained using Spark ML.

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .master("local")
  .getOrCreate()
import spark.implicits._

val toString = udf((payload: Array[Byte]) => new String(payload))
val sentenceDataFrame = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe", "topicname1")
  .load().selectExpr("CAST(value AS STRING)").as[(String)]
sentenceDataFrame.printSchema()
val regexTokenizer = new RegexTokenizer()
  .setInputCol("value")
  .setOutputCol("words")
  .setPattern("\\W")
val tokencsv = regexTokenizer.transform(sentenceDataFrame)
val remover = new StopWordsRemover()
  .setInputCol("words")
  .setOutputCol("filtered")

val removestopdf = remover.transform(tokencsv)
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
  .setInputCol("filtered")
  .setOutputCol("result")
  .setVectorSize(300)
  .setMinCount(0)

val model = word2Vec.fit(removestopdf)

val result = model.transform(removestopdf)


val featureIndexer = new VectorIndexer()
  .setInputCol("result")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(2)
  .fit(result)

val some = featureIndexer.transform(result)

val model1 = RandomForestClassificationModel.load("/home/akhil/Documents/traindata/stages/2_rfc_80e12c5d1259")

  val predict = model1.transform(result)

val query = predict.writeStream
  .outputMode("append")
  .format("console")
  .start()
query.awaitTermination()

When I do prediction on streaming data it gives me following error:

 Exception in thread "main" org.apache.spark.sql.AnalysisException: 
 Queries with streaming sources must be executed with 
writeStream.start();;
kafka
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:196)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:33)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:33)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:58)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:67)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:2547)
at org.apache.spark.sql.Dataset.rdd(Dataset.scala:2544)
at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:175)
at predict1model$.main(predict1model.scala:53)
at predict1model.main(predict1model.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

error is referring to word2vec.fit(removestopdf) line . Any help would be really appreciated .

zero323
  • 322,348
  • 103
  • 959
  • 935
dv akhil
  • 93
  • 8

2 Answers2

3

In general, Structured Streaming cannot (yet - as of Spark 2.2) be used to train Spark ML models. There are some operations that are not supported in Structured Streaming. One of those is to transform a Dataset to its rdd representation. In particular the case of word2Vec, it needs to go to the rdd level to implement fit.

Nevertheless, it's possible to train the model on a static dataset and apply the predictions on the streaming data. The transform operation is usable on a streaming Dataset, like above: val result = model.transform(removestopdf)

In a nutshell, we need to fit the model on a static dataset. The resulting transformer can be applied to a streaming Dataset.

Rick Moritz
  • 1,449
  • 12
  • 25
maasg
  • 37,100
  • 11
  • 88
  • 115
  • You can call the "model.transform" method on your streaming if and only if the implementation of the transform method of your algorithm, is using operations compatible with the Spark Streaming Dataset. For example the https://github.com/titicaca/spark-iforest will not work on Spark Streaming,evnthought you have trained your model on the static dataset. When you load the model and apply it to stream it will trow Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start(). – florins Oct 28 '19 at 11:07
0

You can find a proof of concept on this Github project "Spark Structured Streaming ML"

There is also SPARK-16424 you can follow

Paul Leclercq
  • 989
  • 2
  • 15
  • 26