Update Machine Learning models in Mllib dataframe based PySpark (2.2.0)

Question

I have built a Machine Learning model based on clustering, & now just want to update it with new data periodically (on daily basis). I am using PySpark Mlib, & not able to find any method in Spark for this need.

Note, required method 'partial_fit' available in scikit-learn, but not in Spark.

I am not in favor of appending new data & then re-built the model on daily basis, as it will increase data size & will be computationally expensive.

Please suggest me an effective way for model update or on-line learning using Spark Mllib?

In general case you cannot. Some models (especially in the old API) might have some methods that enable such process, but this exception not rule, and would be applicable only for small subset of iterative algorithms. Also there are a few legacy streaming implementations (regression models, k-means). — zero323, Nov 19 '18 at 12:34
See [StreamingLinearAlgorithm](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.regression.StreamingLinearAlgorithm), [StreamingKMeans](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.StreamingKMeans) and parameters like [initialWeights in LinearRegressionWithSGD.run](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD) — zero323, Nov 19 '18 at 12:43
@user6910411 Thanks for comments, could you please suggest me ways how models are being updated in industry (particularly on-line learning) dealing with massive big data? — bioinformatician, Nov 22 '18 at 08:41
I concur with @user6910411 This is not possible with Apache Spark. And for the records, sklearn or other machine learning libraries can scale with right amount of resources you don't always need Spark. — eliasah, Dec 11 '18 at 14:54

score 0 · Answer 1 · answered Dec 16 '18 at 20:04

You cannot update arbitrary models.

On a few select models this works. On some it works if you accept some loss in a accuracy. But on other models, the only way is to rebuild it completely.

For example support vector machines. The model only stores the support vectors. When updating, you would also need all the non-support vectors in order to find the optimal model.

That is why it is fairly common to build new models every night, for example.

Streaming is quite overrated. In particular k-means. Total nonsense to do online k-meand with "big" (lol) data. Because the new point have next to zero effect, you may just as well do a batch every night. These are just academic toys with no relevancy.

Update Machine Learning models in Mllib dataframe based PySpark (2.2.0)

1 Answers1