1

I'm using Latent Dirichlet Allocation in the Java version of Spark.

The following line works fine:

LDAModel ldaModel = new LDA()//
                        .setK( NUM_TOPICS )//
                        .setMaxIterations( MAX_ITERATIONS )//
                        .run( corpus );

And this uses (I believe) the default EM optimiser.

However, when I try to use the Stochastic Variational Optimizer, as follows:

OnlineLDAOptimizer optimizer = new OnlineLDAOptimizer()//
                                   .setMiniBatchFraction( 2.0 / MAX_ITERATIONS );
LDAModel ldaModel = new LDA()//
                    .setK( NUM_TOPICS )//
                    .setOptimizer( optimizer )//
                    .setMaxIterations( MAX_ITERATIONS )//
                    .run( corpus );

I get the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 11.0 failed 1 times, most recent failure: Lost task 1.0 in stage 11.0 (TID 50, localhost): java.lang.IndexOutOfBoundsException: (0,2) not in [-3,3) x [-2,2)
at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:84)
at breeze.linalg.Matrix$class.apply(Matrix.scala:39)
...

Does anyone have any success in getting the online optimizer to work in the Java version of Spark? As far as I can tell, that's the only difference here.

zero323
  • 322,348
  • 103
  • 959
  • 935
Ben Allison
  • 7,244
  • 1
  • 15
  • 24

2 Answers2

1

I had a similar problem and it turned out that I did a mistake when creating the SparseVectors for the corpus.

Instead of supplying the number of all terms as the first parameter I supplied the length of the indices and value arrays.

This lead to the IndexOutOfBoundException

Vectors.sparse(indices.length, indices, values);

While this works for me

Vectors.sparse(numberOfTermsInCorpus, indices, values);

The exception occurs only when using the OnlineLDAOptimizer. When using the standard EM optimiser my mistake did not affect the creation of the model.

c_froehlich
  • 1,305
  • 11
  • 13
0

I think, the problem is in

.setMiniBatchFraction( 2.0 / MAX_ITERATIONS );

Have you try

.setMiniBatchFraction(math.min(1.0, mbf)))

with mbf is

    val mbf = {
  // add (1.0 / actualCorpusSize) to MiniBatchFraction be more robust on tiny datasets.
  val corpusSize = corpus.count()
  2.0 / maxIterations + 1.0 / corpusSize
}
  • This does not fix the problem. What is this change intended to address (in case I'm missing something?). Also note this relates to the Java, not Scala, version of Spark. – Ben Allison Oct 22 '15 at 12:10