Optimal Document Size for LSI Similarity Model

Question

I'm using Gensim's excellent library to compute similarity queries on a corpus using LSI. However, I have a distinct feeling that the results could be better, and I'm trying to figure out whether I can adjust the corpus itself in order to improve the results.

I have a certain amount of control over how to split the documents. My original data has a lot of very short documents (mean length is 12 words in a document, but there exist documents that are 1-2 words long...), and there are a few logical ways to concatenate several documents into one. The problem is that I don't know whether it's worth doing this or not (and if so, to what extent). I can't find any material addressing this question, but only regarding the size of the corpus, and the size of the vocabulary. I assume this is because, at the end of the day, the size of a document is bounded by the size of the vocabulary. But I'm sure there are still some general guidelines that could help with this decision.

What is considered a document that is too short? What is too long? (I assume the latter is a function of |V|, but the former could easily be a constant value.)

Does anyone have experience with this? Can anyone point me in the direction of any papers/blog posts/research that address this question? Much appreciated!

Edited to add: Regarding the strategy for grouping documents - each document is a text message sent between two parties. The potential grouping is based on this, where I can also take into consideration the time at which the messages were sent. Meaning, I could group all the messages sent between A and B within a certain hour, or on a certain day, or simply group all the messages between the two. I can also decide on a minimum or maximum number of messages grouped together, but that is exactly what my question is about - how do I know what the ideal length is?

sophros · Answer 1 · 2017-08-08T17:49:18.893

1

Looking at number of words per document does not seem to me to be the correct approach. LSI/LSA is all about capturing the underlying semantics of the documents by detecting common co-occurrences.

You may want to read:

LSI: Probabilistic Analysis
Latent Semantic Analysis (particularly section 3.2)

A valid excerpt from 2:

An important feature of LSI is that it makes no assumptions about a particular generative model behind the data. Whether the distribution of terms in the corpus is “Gaussian”, Poisson, or some other has no bearing on the effectiveness of this technique, at least with respect to its mathematical underpinnings. Thus, it is incorrect to say that use of LSI requires assuming that the attribute values are normally distributed.

The thing I would be more concerned is if the short documents share similar co-occurring terms that will allow LSI to form an appropriate topic grouping all of those documents that for a human share the same subject. This can be hardly done automatically (maybe with a WordNet / ontology) by substituting rare terms with more frequent and general ones. But this is a very long shot requiring further research.

More specific answer on heuristic:
My best bet would be to treat conversations as your documents. So the grouping would be on the time proximity of the exchanged messages. Anything up to a few minutes (a quarter?) I would group together. There may be false positives though (strongly depending on the actual contents of your dataset). As with any hyper-parameter in NLP - your mileage will vary... so it is worth doing a few experiments.

edited Aug 08 '17 at 17:49

answered Aug 08 '17 at 16:17

sophros

14,672
11
46
75

I haven't quite had time to read your links yet, but you might want to note that they're identical. Did you mean to put two different links there? – faerubin Aug 08 '17 at 16:20
Your claim about detecting common co-occurences doesn't contradict the question of length. For example, if all the documents are of length 2, there isn't much meaning to co-occurence. Similarity, if the documents are very long, too many words co-occur with each other, making high co-occurence less significant. – faerubin Aug 08 '17 at 16:26
Granted. My point was more about concentrating on the distribution of lengths of documents whereas what seems to matter with LSI are meaningful (distinctive for the topics found) co-occurrences in the documents irrespective of their length. Your edge case of a document with length=2 is valid but I would hope - rare in your document set. Otherwise, using LSI does not seem to make much sense (if most of your documents have 2 words in them). Other approaches would seem more suitable to me. – sophros Aug 08 '17 at 16:34
Maybe getting into the variance of document length in my corpus was a mistake. It's not pertinent to the question. The point was that I have very short documents, and the question is whether I should adjust the corpus so the documents are larger - and if so, to what extent. I'll edit the question to clarify that point. – faerubin Aug 08 '17 at 16:36
OK, good idea to rephrase. Touching on the adjustments to the corpus - what would be the criterion to glue documents together? It appears to me that you are intending to use an automated topic modelling approach (LSI) while doing some of the work manually (grouping the documents with respect to topics). Hm... Would that help with LSI? Possibly. Depending on the number of topics you will want from LSI, the co-occurrences may be overly spread out across many short documents to form meaningful topics. You would need at least a few lengthier ones which share the terms with short documents. – sophros Aug 08 '17 at 16:43
The grouping is based on heuristics completely separate from content. – faerubin Aug 08 '17 at 16:51
Regarding the length - are you saying that high variance of document length is a good thing? – faerubin Aug 08 '17 at 16:52
The question remains - how much the heuristic correlate with semantics of the documents. Without more details it is hard to tell. Given this is all depending on may factors I would just try with grouping and without on a sample from the corpus. There may be some peculiarities about your heuristic and/or dataset. – sophros Aug 08 '17 at 16:54
I am not saying high variance of length is a good thing. I am saying too many extremely short documents (say, < 10 words) from diverse topics are going to negatively affect LSI results as long as most of the terms pertaining to one topic do not occur in larger documents (AFAIK only in this way short documents can be attributed the common topic). Otherwise you end up with as many topics as you have documents (no co-occurrences that would steer LSI to think they belong to a common topic) or topics with only irrelevant terms (frequent, so shared between these short docs - hence useless). – sophros Aug 08 '17 at 17:02
You make a fair point. I edited the question to explain what the grouping strategies are. – faerubin Aug 08 '17 at 17:31
If you found this answer satisfactory please mark it as such. – sophros Aug 08 '17 at 22:03
while I appreciate your help in making the required clarifications to my question, your answer doesn't really add anything beyond the thoughts I already articulated in the original question. I'm looking for material with more conclusive guidelines regarding document length. Obviously, that might not exist. But for now, I'm not accepting your answer. I will upvote it, though, because I can see the thought and effort you put in it. :) – faerubin Aug 09 '17 at 08:33

score 1 · Answer 2 · answered Aug 09 '17 at 10:25

Short documents are indeed a challenge when it comes to applying LDA, since the estimates for the word co-occurrence statistics are significantly worse for short documents (sparse data). One way to alleviate this issue is, as you mentioned, to somehow aggregate multiple short texts into one longer document by some heuristic measure.

One particularity nice test-case for this situation is topic modeling Twitter data, since it's limited by definition to 140 characters. In Empirical Study of Topic Modeling in Twitter (Hong et al, 2010), the authors argue that

Training a standard topic model on aggregated user messages leads to a faster training process and better quality.

However, they also mention that different aggregation methods lead to different results:

Topics learned by using different aggregation strategies of the data are substantially different from each other.

My recommendations:

If you are using your own heuristic for aggregating short messages into longer documents, make sure to experiment with different aggregation techniques (potentially all the "sensical" ones)
Consider using a "heuristic-free" LDA variant that is better tailored for short messages, e.g, Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words

I'm talking about LSI and not LDA, but at first glance, I think everything you're saying is relevant to LSI as well. In any case, I look forward to reading the article. Hopefully it'll send me in the right direction. — faerubin, Aug 09 '17 at 12:48

Optimal Document Size for LSI Similarity Model

2 Answers2