2

I'm trying to cluster similar documents using the R language. As a first step, I compute the term-document matrix for my set of documents. Then I create the latent semantic space for the term-document matrix previously created. I decided to use use LSA in my expriment because the results of clustering using just the term-document matrix were awful . Is possible to build a dissimilarity matrix (with cosine measure) using the the LSA space created? I need to do this because the clustering algorithm that I'm using requires a dissimilarity matrix as input.

Here is my code:

require(cluster);
require (lsa);

myMatrix = textmatrix("/home/user/DocmentsDirectory");
myLSAspace = lsa(myMatrix, dims=dimcalc_share());

I need to build a dissimilarity matrix (using cosine measure) from LSA space, so I can call the cluster algorithm as follows:

clusters = pam(dissimilartiyMatrix,10,diss=TRUE);

Any suggestions?

Thanks in advance!

lucasbls1
  • 83
  • 2
  • 5

2 Answers2

6

To compare two documents in the LSA-space, you can take the cross product of the $sk and $dk matrices that lsa() returns to get all the documents in the lower dimensional LSA-space. Here's what I did:

lsaSpace <- lsa(termDocMatrix)

# lsaMatrix now is a k x (num doc) matrix, in k-dimensional LSA space
lsaMatrix <- diag(lsaSpace$sk) %*% t(lsaSpace$dk)

# Use the `cosine` function in `lsa` package to get cosine similarities matrix
# (subtract from 1 to get dissimilarity matrix)
distMatrix <- 1 - cosine(lsaMatrix)

See http://en.wikipedia.org/wiki/Latent_semantic_analysis, where it says you can now use LSA results to "see how related documents j and q are in the low dimensional space by comparing the vectors sk*d_j and sk*d_q (typically by cosine similarity)."

brian.keng
  • 1,931
  • 2
  • 15
  • 11
2

You can use package arules , here an example:

 library(arules)
 dissimilarity(x=matrix(seq(1,10),ncol=2),method='cosine')
          1         2         3         4
2 -4.543479                              
3 -4.811989 -5.231234                    
4 -5.080052 -5.563952 -6.024433          
5 -5.343350 -5.885304 -6.395740 -6.877264
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • My main problem is that I need to calculate the dissimilarity matrix using the the LSA space created. Do you know how to do that? – lucasbls1 Mar 05 '13 at 17:21