I build my corpus from a text file, and corpus is a JavaPairRDD<Long, Vector>
of a document ID (creatd with zipWithIndex()
) and a count of how many times each word in the vocabulary appears in each document. I try to count the documents below and I think I should be getting the same number.
System.out.println("Corpus: " + corpus.count());
// Cluster the documents into three topics using LDA
DistributedLDAModel ldaModel = (DistributedLDAModel) new LDA().setK(6).run(corpus);
System.out.println("LDA Model: " + ldaModel.topTopicsPerDocument(2).count());
When I am counting vocabulary per document, I am only looking at the most common words. It is possible that two documents look identical based on this, and also possible that if a document had only uncommon words, it would be all zeros.
I'm looking into this causing issues myself, but if there is a way to keep documents that fall into either of these situations from being "pruned" or whatever is happening, that would probably solve my issue.
In the first println
I get 1642012. After creating my LDA model and checking the size I only have 1582030. I'm missing 59982 documents.
What is happening to these missing documents?