Pairwise Distance between documents

Question

I am trying to calculate similarity of rows of one document term matrix with rows of another document term matrix.

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "x-ray left shoulder",
  "x-ray leg arteries",
  "x-ray leg with 20km distance"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "X-ray leg",
  "xray right leg",
  "X-ray right leg arteries"
), stringsAsFactors = F)

corp1 <- corpus(A, text_field = "name")
corp2 <- corpus(B, text_field = "name")

docnames(corp1) <- paste("A", seq_len(ndoc(corp1)), sep = ".")
docnames(corp2) <- paste("B", seq_len(ndoc(corp2)), sep = ".")

dtm3 <- rbind(dfm(corp1, ngrams=2), dfm(corp2, ngrams=2))
d1 = textstat_simil(dtm3, method = "cosine")
d1 = as.matrix(d1)

d1 = d1[grepl("^A.",row.names(d1)),grepl("^B.",colnames(d1))]

In the code I am calculating similarity on combined matrix and later removing irrelevant cells from the matrix. Is it possible to compare one document from A at a time in textstat_simil(dtm3, method = "cosine")? Below the table I am looking for. Also the file size of the matrix got doubled when I use as.matrix(d1).

         B.1       B.2       B.3       B.4
A.1 0.3333333 0.0000000 0.4082483 1.0000000
A.2 0.4082483 0.0000000 0.0000000 0.0000000
A.3 0.4082483 0.7071068 0.0000000 0.4082483
A.4 0.0000000 0.5000000 0.0000000 0.0000000

score 1 · Accepted Answer · answered Feb 18 '18 at 08:52

This will work, although as you point out, it doubles the cosine similarity matrix size in coercing the dist class return from textstat_simil() into a matrix.

d2 <- textstat_simil(dtm3, method = "cosine", diag = TRUE)
as.matrix(d2)[docnames(corp1), docnames(corp2)]
#           B.1       B.2       B.3       B.4
# A.1 0.3333333 0.0000000 0.4082483 1.0000000
# A.2 0.4082483 0.0000000 0.0000000 0.0000000
# A.3 0.4082483 0.7071068 0.0000000 0.4082483
# A.4 0.0000000 0.5000000 0.0000000 0.0000000

Note that your use of ngrams=2 in the creation of dtm3 will create a dfm from only bigram features (which are quire infrequent). If you want unigrams as well as bigrams, then this should be ngrams = 1:2 instead.

That should work pretty well for most problems. If you are worried about the size of your object, you can either loop across individual selections of the dtm3, building up the target object, or lapply() the comparisons as follows (but this is much less efficient):

cosines <- lapply(docnames(corp2), 
                  function(x) textstat_simil(dtm3[c(x, docnames(corp1)), ],
                                             method = "cosine",
                                             selection = x)[-1, , drop = FALSE])
do.call(cbind, cosines)
#           B.1       B.2       B.3       B.4
# A.1 0.3333333 0.0000000 0.4082483 1.0000000
# A.2 0.4082483 0.0000000 0.0000000 0.0000000
# A.3 0.4082483 0.7071068 0.0000000 0.4082483
# A.4 0.0000000 0.5000000 0.0000000 0.0000000

Thanks Ken. It works like charm. Just a question - Do you think tf-idf with normalisation would help in document similarity? I think it helps when you compare documents within the same corpus. How about normalisation when calculating similarity — john, Feb 18 '18 at 15:23
tf-idf will downweight features that occur in many documents, thereby reducing similarity. The question is whether you want this. Normalization (using relative term frequencies) will not affect cosine similarity, since it is based on the angles between dimension vectors, not their length. (Try it!) — Ken Benoit, Feb 19 '18 at 08:23

Pairwise Distance between documents

1 Answers1

Linked