3

I have a data frame with 2 text fields: comment and the main post

basically this is the structure

         id  comment                        post_text
          1   "I think that blabla.."        "Why is blabla.."
          2   "Well, you should blabla.."    "okay, blabla.."
          3    ...

I want to compute the similarity between the text in the comment in row one and the text in post_text in row one, and do this for all the rows. as far as I know, I have to create separate dfm objects for the two types of texts

          corp1 <- corpus(r , text_field= "comment")
          corp2 <- corpus(r , text_field= "post_text")
          dfm1 <- dfm(corp1)
          dfm2 <- dfm(corp2)

in the end, I want to obtain something like this:

id  comment                     post_text          similarity
1   "I think that blabla.."     "Why is blabla.."  *similarity between comment1 and post_text1
2   "Well, you should blabla.." "okay, blabla.."  *similarity between comment2 and post_text2
3    ...

I am not sure how to proceed, I found this on StackOverflow Pairwise Distance between documents but they are computing cross-similarity between dfm while I need similarity by row,

so basically what I thought was to do the following:

      dtm <- rbind(dfm(corp1), dfm(corp2))
      d2 <- textstat_simil(dtm, method = "cosine", diag = TRUE)
      matrixsim<- as.matrix(d2)[docnames(corp1), docnames(corp2)]
      diagonale <- diag(matrixsim)

but the diagonal is just a list of 1 1 1 1..

any idea on how I can solve this problem? thank you in advance for your help,

Carlo

David Batista
  • 3,029
  • 2
  • 23
  • 42
Carbo
  • 906
  • 5
  • 23
  • Well, in your example, the cosine similarity will be exactly 1.0 between "blabla" and "blabla". Where would the similarity of "3,08" come from? (that's an impossible value for cosine similarity btw). And: Are the unit ids different documents, so that you want the similarity between id1.comment and id1.post_text, with each being vectors of tokens? – Ken Benoit Apr 11 '19 at 10:32
  • I wrote blabla for sake of simplicity, I realize that I may be confusing, the texts would be different for isntance in row one I have comment="You already agree that taxes go toward services that societ.." and text_field="First of all I am not an extremist on this view, I personally fin..". Ids refer to the couple of comment-text_field I want to calculate the similarity. basically, I want to calculare the similarity between each couple of comment and text-field – Carbo Apr 11 '19 at 10:38
  • I edited the question to make it clearer @KenBenoit – Carbo Apr 11 '19 at 10:51

1 Answers1

5

I'd do it by creating a single column of documents, but distinguish them using docnames indicating the type of document.

df <- data.frame(
  id = c(1, 2),
  comment = c(
    "I think that blabla..",
    "Well, you should blabla"
  ),
  post_text = c(
    "Why is blabla",
    "okay, blabla"
  ),
  stringsAsFactors = FALSE
)

# stack these into a single "document" column, plus a docvar
# identifying the document type
df <- tidyr::gather(df, "source", "text", -id)
df
##   id    source                    text
## 1  1   comment   I think that blabla..
## 2  2   comment Well, you should blabla
## 3  1 post_text           Why is blabla
## 4  2 post_text            okay, blabla

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

corp <- corpus(df)
docnames(corp) <- paste(df$id, df$source, sep = "_")
dfm(corp) %>%
  textstat_simil()
##               1_comment   2_comment 1_post_text
## 2_comment   -0.39279220                        
## 1_post_text -0.14907120 -0.09759001            
## 2_post_text -0.14907120  0.29277002  0.11111111

You now can slice out what you want using matrix subsetting. (Use as.matrix() to turn the output from textstat_simil() into a matrix.)

Ken Benoit
  • 14,454
  • 27
  • 50
  • 1
    For those reading this after 2020, `textstat_simil` is now part of the separate package quanteda.textstats – thiagogps Oct 08 '21 at 16:03