Document similarity using LSA in R

Question

I am working on LSA (using R) for Document Similarity Analysis. Here are my steps

Imported the text data & created Corpus. Did basis Corpus operations like stemming, white space removal etc
Created LSA space as below

tdm <- TermDocumentMatrix(chat_corpus) tdm_matrix <- as.matrix(tdm) tdm.lsa <- lw_bintf(tdm_matrix)*gw_idf(tdm_matrix) lsaSpace <- lsa(tdm.lsa)
Multi Dimensional Modelling (MDS) on LSA

'

dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace)))
fit <- cmdscale(dist.mat.lsa,eig = T)
points <- data.frame(fit1$points,row.names=chat$text)

I want to create a matrix/data frame showing how similar the texts are (as shown in the attachment Result). Rows & Columns will be the texts to match while the cell values will be their similarity value. Ideally the diagonal values will be one 1 (perfect match) while the rest of the cell values will be lesser than 1.

Please trow some insights into how to do this. Thanks in advance

Note : I got the python code for this but need the same in R

similarity = np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)
pd.DataFrame(similarity,index=example, columns=example).head(10)

Expected Result

score 0 · Answer 1 · answered Jul 26 '18 at 10:48

In order to do this you first need to take the S_k and D_k matrices from the lsa space you've created and multiply S_k by the transpose of D_k to get a k by n matrix, where k is the number of dimensions and n is the number of documents. This code would be as follows:

lsaMatrix <- diag(myLSAspace$sk) %*% t(myLSAspace$dk)

Then it's as simple as putting the resulting matrix through the cosine function from the lsa package:

simMatrix <- cosine(lsaMatrix)

Which will result in an n^2 size similarity matrix which can then be used for clustering etc.

You can read more about the S_k and D_k matrices in the lsa package documentation, they're outputs of the SVD applied.

https://cran.r-project.org/web/packages/lsa/lsa.pdf

Document similarity using LSA in R

1 Answers1