Spark's OnlineLDAOptimizer causing IndexOutOfBoundsException in Java

Question

I'm using Latent Dirichlet Allocation in the Java version of Spark.

The following line works fine:

LDAModel ldaModel = new LDA()//
                        .setK( NUM_TOPICS )//
                        .setMaxIterations( MAX_ITERATIONS )//
                        .run( corpus );

And this uses (I believe) the default EM optimiser.

However, when I try to use the Stochastic Variational Optimizer, as follows:

OnlineLDAOptimizer optimizer = new OnlineLDAOptimizer()//
                                   .setMiniBatchFraction( 2.0 / MAX_ITERATIONS );
LDAModel ldaModel = new LDA()//
                    .setK( NUM_TOPICS )//
                    .setOptimizer( optimizer )//
                    .setMaxIterations( MAX_ITERATIONS )//
                    .run( corpus );

I get the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 11.0 failed 1 times, most recent failure: Lost task 1.0 in stage 11.0 (TID 50, localhost): java.lang.IndexOutOfBoundsException: (0,2) not in [-3,3) x [-2,2)
at breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:84)
at breeze.linalg.Matrix$class.apply(Matrix.scala:39)
...

Does anyone have any success in getting the online optimizer to work in the Java version of Spark? As far as I can tell, that's the only difference here.

c_froehlich · Answer 1 · 2017-11-18T17:04:43.547

1

I had a similar problem and it turned out that I did a mistake when creating the SparseVectors for the corpus.

Instead of supplying the number of all terms as the first parameter I supplied the length of the indices and value arrays.

This lead to the IndexOutOfBoundException

Vectors.sparse(indices.length, indices, values);

While this works for me

Vectors.sparse(numberOfTermsInCorpus, indices, values);

The exception occurs only when using the OnlineLDAOptimizer. When using the standard EM optimiser my mistake did not affect the creation of the model.

edited Nov 18 '17 at 17:04

answered Apr 19 '16 at 17:10

c_froehlich

1,305
11
13

just a note, this is not really a bug. It is by design – Jake Nov 17 '17 at 14:17
1

Absolutely @Jake, you are right. I revised my wording. – c_froehlich Nov 18 '17 at 17:05

score 0 · Answer 2 · answered Oct 20 '15 at 16:59

0

I think, the problem is in

.setMiniBatchFraction( 2.0 / MAX_ITERATIONS );

Have you try

.setMiniBatchFraction(math.min(1.0, mbf)))

with mbf is

    val mbf = {
  // add (1.0 / actualCorpusSize) to MiniBatchFraction be more robust on tiny datasets.
  val corpusSize = corpus.count()
  2.0 / maxIterations + 1.0 / corpusSize
}

answered Oct 20 '15 at 16:59

Thanh Thai Nguyen

255
4
15

This does not fix the problem. What is this change intended to address (in case I'm missing something?). Also note this relates to the Java, not Scala, version of Spark. – Ben Allison Oct 22 '15 at 12:10

Spark's OnlineLDAOptimizer causing IndexOutOfBoundsException in Java

2 Answers2