I have a data frame with 2 text fields: comment and the main post
basically this is the structure
id comment post_text
1 "I think that blabla.." "Why is blabla.."
2 "Well, you should blabla.." "okay, blabla.."
3 ...
I want to compute the similarity between the text in the comment in row one and the text in post_text in row one, and do this for all the rows. as far as I know, I have to create separate dfm objects for the two types of texts
corp1 <- corpus(r , text_field= "comment")
corp2 <- corpus(r , text_field= "post_text")
dfm1 <- dfm(corp1)
dfm2 <- dfm(corp2)
in the end, I want to obtain something like this:
id comment post_text similarity
1 "I think that blabla.." "Why is blabla.." *similarity between comment1 and post_text1
2 "Well, you should blabla.." "okay, blabla.." *similarity between comment2 and post_text2
3 ...
I am not sure how to proceed, I found this on StackOverflow Pairwise Distance between documents but they are computing cross-similarity between dfm while I need similarity by row,
so basically what I thought was to do the following:
dtm <- rbind(dfm(corp1), dfm(corp2))
d2 <- textstat_simil(dtm, method = "cosine", diag = TRUE)
matrixsim<- as.matrix(d2)[docnames(corp1), docnames(corp2)]
diagonale <- diag(matrixsim)
but the diagonal is just a list of 1 1 1 1..
any idea on how I can solve this problem? thank you in advance for your help,
Carlo