0

In Spark 2.0.1 (pyspark), I want to learn an LDA with the online optimizer. Does this version of the optimizer makes possible the update of the model each day (for example)? I'm not sure I understand the meaning of online here and its implications. Does it mean that:

A) I have to load the entire corpus and the model will learn by mini-batches (and because of that, maybe be faster than its EM counterpart).

B) I can submit to the learner a fraction of the corpus and get a first model and subsequently submit another fraction and get an upgraded version of the first model.

Thanks for clarifying

EDIT: to be specific, what I do is:

from pyspark.ml.clustering import LDA
lda = LDA(k=nclusters, seed=1, optimizer="online")
ldaModel = lda.fit(mydf.select([mydf["id"],mydf["features"]]))

With my ldaModel fitted, can I upgrade it with new df? It should be the case in my opinion since the online optimizer does essentially that, sampling the corpus at each iteration and upgrade the model against a subset of it, doesn't it?

Patrick
  • 2,577
  • 6
  • 30
  • 53
  • I just don't understand why some vote for closing this question. I would like to know. Is it because it's not the good place to ask? Or because the question is not well formulated (english is not my first language)? Or because it's too silly? – Patrick Feb 09 '17 at 14:54
  • it's because this question is not a programming question with a specific example, and without that it is out of scope for SO. – mtoto Feb 10 '17 at 10:30
  • Ok. Question edited, thanks @mtoto. – Patrick Feb 10 '17 at 19:38

0 Answers0