2

I am using Latent semantic analysis for text similarity. I have 2 questions.

  1. How to select K value for dimention reduction?

  2. I read alot every where that LSI work for similary meaning words for example car and automobile. How is it possible??? What is the magic step I am missing here?

skaffman
  • 398,947
  • 96
  • 818
  • 769
user238384
  • 2,396
  • 10
  • 35
  • 36

2 Answers2

1
  1. The typical choice for k is 300. Ideally, you set k based on an evaluation metric that uses the reduced vectors. For example, if you're clustering documents, you could select the k that maximizes the clustering solution score. If you don't have a benchmark to measure against, then I would set k based on how big your data set is. If you only have 100 documents, then you wouldn't expect to need several hundred latent factors to represent them. Likewise, if you have a million documents, then 300 may be too small. However, in my experience the resulting vectors are fairly robust to large changes in k, provided that k is not too small (i.e., k = 300 does about as well as k = 1000).

  2. You might be confusing LSI with Latent Semantic Analysis (LSA). They're very related techniques, with the difference being that LSI operates on documents, and LSA operates on words. Both approaches use the same input (a term x document matrix). There are several good open source LSA implementations if you would like to try them. The LSA wikipedia page has a comprehensive list.

David Jurgens
  • 304
  • 1
  • 8
0
  1. try a couple of different values from [1..n] and see what works for whatever task you are trying to accomplish

  2. Make a word-word correlation matrix [ i.e. cell(i,j) holds the # of docs where (i,j) co-occur ] and use something like PCA on it

Aditya Mukherji
  • 9,099
  • 5
  • 43
  • 49