-1

I am learning to assess text similarity in between documents. Going through the text2vec tutorial (http://text2vec.org/similarity.html) on the topic, I noticed that the code returns two values for similarity. Here is the tail end of the code in the tutorial from Dmitriy Selivanov:

d1_d2_cos_sim = sim2(dtm1, dtm2, method = "cosine", norm = "l2")
dim(d1_d2_cos_sim)

[1] 300 200

Which returned value (300 or 200) describes text similarity/difference?

1 Answers1

0

It describes neither. dim just returns the number of rows and columns of the d1_d2_cos_sim matrix, 300 by 200. The similarity is inside the object d1_d2_cos_sim as you can see in the next line of code d1_d2_cos_sim[1:2, 1:5] which returns the first 2 rows and first 5 columns. This shows the similarity of the first 2 documents of d1 versus the first 5 documents of d2.

phiver
  • 23,048
  • 14
  • 44
  • 56