0

I am currently using the ParallelTopicModel for topic modeling, but I've encountered some strange behavior. When I set different number of threads for the model, I get different results which should not happen if I'm right. The implementation we've written is used on different machines with a different number of maximal threads, but somehow the results are different. Random seed, documents, iterations etc. are the same.

Is this a known bug or expected? Or am I just doing something wrong?

Code Snippet:

    // Begin by importing documents from text to feature sequences
    final InstanceList instances = new InstanceList(docPipe);
    instances.addThruPipe(docsIter);
    final ParallelTopicModel model =
        new ParallelTopicModel(noOfTopics, m_alpha.getDoubleValue() * noOfTopics, m_beta.getDoubleValue());
    model.setRandomSeed(m_seed.getIntValue());
    model.addInstances(instances);
    model.setNumThreads(noOfThreads);
    model.setNumIterations(noOfIterations);
    try {
        model.estimate();
    } catch (RuntimeException e) {
        throw e;
    }
bunzJ
  • 3
  • 2

1 Answers1

1

Each thread has its own random number generator. Setting the seed initializes each of these to the same sequence, so if you have the same number of threads you should get the same results. Each thread is responsible for its own segment of the collection.

If you have a different number of threads, the same random numbers are being applied to different tokens, which have different sampling distributions, and so will have different sampling outcomes.

Keeping a single random number generator would add a synchronization dependency, and would not guarantee identical results unless the threads are exactly synchronized.

David Mimno
  • 1,836
  • 7
  • 7
  • Okay, thank you for your comment. The problem for me is, that due to the different number of threads I cannot reproduce results on different machines, except setting the lowest common number of threads that is possible on each machine right? – bunzJ Jan 05 '18 at 09:58
  • That's correct. Keep in mind that reproducibility given a random seed should only be used to establish that there are no inconsistencies in your pipeline. It does not suggest that the reproduced model is optimal, more reliable, or more valid. – David Mimno Jan 05 '18 at 14:17