1

I am attempting to model topcis using Mallet. I have repeatedly seen statements in blog posts and research papers recommending to limit the number of words per document - in most cases around 1000 words. The fact that LDA requires a minimum number of words is clear, of course. However, is it true that there is a technical reason to recommend splitting larger documents into smaller chunks? My documents range between 5k-20k words. Would I be better off splitting a 5k document into multiple documents?

Many thanks in advance!

Glorifier
  • 31
  • 1

1 Answers1

1

There are a couple of reasons for splitting long documents into smaller chunks.

The intuitive reason though is that longer documents are more likely to be generated from more topics. You can certainly set your parameters to account for this, but we know that words that appear near each other are more likely to be in the same topic as words that appear further (even within the same document) from each other. We can account for this distance by splitting larger documents. Think of this as splitting a book into chapters instead of putting the entire book into the model.

There is also a computational reason for splitting documents into smaller chunks. This has to do with the relative computational cost of generating a longer document vs the cost of generating a few shorter ones, and of approximating the topic for a longer document vs a shorter one. I don't remember the math off the top of my head, but it's generally faster to run a model on 1,000,000 documents of 100 words each than 100,000 documents of 1000 words each.

Dharman
  • 30,962
  • 25
  • 85
  • 135
rchurch4
  • 859
  • 6
  • 14
  • 1
    Many thanks for your comment! I certainly get your point regarding the correlation between length of document and number of topics. My data base is entirely composed of facebook comments but I could reduce the time frame for each document (less comments=shorter document) and thereby limit the document size. I will have to see where I end up topic-wise anyway. I'll start with 10 and then see for what number I get the best results. Cheers – Glorifier Mar 14 '21 at 22:30