quanteda: calculate text similarity by row between two DFMs

Question

I have a data frame with 2 text fields: comment and the main post

basically this is the structure

         id  comment                        post_text
          1   "I think that blabla.."        "Why is blabla.."
          2   "Well, you should blabla.."    "okay, blabla.."
          3    ...

I want to compute the similarity between the text in the comment in row one and the text in post_text in row one, and do this for all the rows. as far as I know, I have to create separate dfm objects for the two types of texts

          corp1 <- corpus(r , text_field= "comment")
          corp2 <- corpus(r , text_field= "post_text")
          dfm1 <- dfm(corp1)
          dfm2 <- dfm(corp2)

in the end, I want to obtain something like this:

id  comment                     post_text          similarity
1   "I think that blabla.."     "Why is blabla.."  *similarity between comment1 and post_text1
2   "Well, you should blabla.." "okay, blabla.."  *similarity between comment2 and post_text2
3    ...

I am not sure how to proceed, I found this on StackOverflow Pairwise Distance between documents but they are computing cross-similarity between dfm while I need similarity by row,

so basically what I thought was to do the following:

      dtm <- rbind(dfm(corp1), dfm(corp2))
      d2 <- textstat_simil(dtm, method = "cosine", diag = TRUE)
      matrixsim<- as.matrix(d2)[docnames(corp1), docnames(corp2)]
      diagonale <- diag(matrixsim)

but the diagonal is just a list of 1 1 1 1..

any idea on how I can solve this problem? thank you in advance for your help,

Carlo

Well, in your example, the cosine similarity will be exactly 1.0 between "blabla" and "blabla". Where would the similarity of "3,08" come from? (that's an impossible value for cosine similarity btw). And: Are the unit ids different documents, so that you want the similarity between id1.comment and id1.post_text, with each being vectors of tokens? — Ken Benoit, Apr 11 '19 at 10:32
I wrote blabla for sake of simplicity, I realize that I may be confusing, the texts would be different for isntance in row one I have comment="You already agree that taxes go toward services that societ.." and text_field="First of all I am not an extremist on this view, I personally fin..". Ids refer to the couple of comment-text_field I want to calculate the similarity. basically, I want to calculare the similarity between each couple of comment and text-field — Carbo, Apr 11 '19 at 10:38

score 5 · Accepted Answer · answered Apr 11 '19 at 10:58

I'd do it by creating a single column of documents, but distinguish them using docnames indicating the type of document.

df <- data.frame(
  id = c(1, 2),
  comment = c(
    "I think that blabla..",
    "Well, you should blabla"
  ),
  post_text = c(
    "Why is blabla",
    "okay, blabla"
  ),
  stringsAsFactors = FALSE
)

# stack these into a single "document" column, plus a docvar
# identifying the document type
df <- tidyr::gather(df, "source", "text", -id)
df
##   id    source                    text
## 1  1   comment   I think that blabla..
## 2  2   comment Well, you should blabla
## 3  1 post_text           Why is blabla
## 4  2 post_text            okay, blabla

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

corp <- corpus(df)
docnames(corp) <- paste(df$id, df$source, sep = "_")
dfm(corp) %>%
  textstat_simil()
##               1_comment   2_comment 1_post_text
## 2_comment   -0.39279220                        
## 1_post_text -0.14907120 -0.09759001            
## 2_post_text -0.14907120  0.29277002  0.11111111

You now can slice out what you want using matrix subsetting. (Use as.matrix() to turn the output from textstat_simil() into a matrix.)

For those reading this after 2020, `textstat_simil` is now part of the separate package quanteda.textstats — thiagogps, Oct 08 '21 at 16:03

quanteda: calculate text similarity by row between two DFMs

1 Answers1