1

I build my corpus from a text file, and corpus is a JavaPairRDD<Long, Vector> of a document ID (creatd with zipWithIndex()) and a count of how many times each word in the vocabulary appears in each document. I try to count the documents below and I think I should be getting the same number.

    System.out.println("Corpus: " + corpus.count());

    // Cluster the documents into three topics using LDA
    DistributedLDAModel ldaModel = (DistributedLDAModel) new LDA().setK(6).run(corpus);

    System.out.println("LDA Model: " + ldaModel.topTopicsPerDocument(2).count());

When I am counting vocabulary per document, I am only looking at the most common words. It is possible that two documents look identical based on this, and also possible that if a document had only uncommon words, it would be all zeros.

I'm looking into this causing issues myself, but if there is a way to keep documents that fall into either of these situations from being "pruned" or whatever is happening, that would probably solve my issue.

In the first println I get 1642012. After creating my LDA model and checking the size I only have 1582030. I'm missing 59982 documents.

What is happening to these missing documents?

zero323
  • 322,348
  • 103
  • 959
  • 935
maccam912
  • 792
  • 1
  • 7
  • 22

1 Answers1

1

I found my issue. My corpus was filled with documents that DID have only UNcommon words. The resulting vector of how often each word in our common words vocab appeared looked like [0, 0, 0, 0, 0, ... ,0] and evidently was removed before building the LDA model.

I could fix this by including all the words in the vocab, not just the common words, or (which is what I did) added a spot for uncommon words at the end so every document with at least one word had at least one non-zero element in the vector.

maccam912
  • 792
  • 1
  • 7
  • 22