0

Apparently, the LSHModel of MLLib from spark 2.4 supports Spark Structured Streaming (https://issues.apache.org/jira/browse/SPARK-24465).

However, it's not clear to me how. For instance an approxSimilarityJoin from MinHashLSH transformation (https://spark.apache.org/docs/latest/ml-features#lsh-operations) could be applied directly to a streaming dataframe?

I don't find more information online about it. Could someone help me?

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
Galuoises
  • 2,630
  • 24
  • 30

1 Answers1

0

You need to

  1. Persist the trained model (e.g. modelFitted) somewhere accessible to your Streaming job. This is done outside of your streaming job.
modelFitted.write.overwrite().save("/path/to/model/location")
  1. Then load this model within you Structured Streaming job
import org.apache.spark.ml._
val model = PipelineModel.read.load("/path/to/model/location")
  1. Apply this model to your streaming Dataframe (e.g. df) with
model.transform(df)

// in your case you may work with two streaming Dataframes to apply `approxSimilarityJoin`.

It might be required to get the streaming Dataframe into the correct format to be used in the model prediction.

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
  • I see. If I understand correctly, this means that the model has to be re-fitted on the entire data stream every time a new batch is read? – Galuoises Mar 03 '21 at 11:16
  • Moreover, it seems that for a data stream, `approxSimilarityJoin` has to be computed between each new batch and all the previous data in the stream, in a stateful way. So all data in stream has to saved first. Is this the case or I am not getting something right? – Galuoises Mar 03 '21 at 12:07