I am currently analyzing two datasets. Dataset A has about 600000+ documents whereas Dataset B has about 7000+ documents. Does this mean that the topic outputs will be more about Dataset A because it has a larger N? The output of mallet in Rapidminer still accounts for which documents fall under each topic. I wonder if there is a way to make the two datasets be interpreted with equal weights?
Asked
Active
Viewed 26 times
1 Answers
0
I am assuming you're mixing the two documents in the training corpus altogether and peform the training. Under this assumption, then it is very likely that the topic outputs will be more about dataset "coming" from A rather than B, as the Gibbs sampling would construct topics according to the co-occurence of tokens which most likely falls from A as well. However inter-topics or similarity of topic across two datasets overlaps is also possible.
You can sample document A instead so that it has same number of documents as B, assuming their topics structure is not that different. Or, you can check the log output from --output-state parameter to see exactly the assigned topic (z) for each token.

Agung Dewandaru
- 198
- 1
- 7