How can I calculate cosine similarity between two sets of individual documents, using quanteda?

Question

I have two sets of documents: One with approx. 580 news articles and one with approx. 560 political decisions. I want to find out whether there are similarities between the individual news articles and the political decisions. This means that each individual news article should be compared with each of the 560 political decisions, using cosine similarity. I am using the quanteda package.

This is what I have tried so far:

news_articles <- readtext(paste0(txt_directory, "*"), encoding = "UTF-8")
news_articles_corpus <- corpus(news_articles)

pol_decisions <- readtext(paste0(txt_directory, "*"), encoding = "UTF-8")
pol_decisions_corpus <- corpus(pol_decisions)

news_articles_toks <- tokens(
  news_articles_corpus,
  what = "word",
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_numbers = TRUE,
  remove_url = TRUE,
  remove_separators = TRUE,
  verbose = TRUE)

news_articles_toks <- tokens_tolower(news_articles_toks, keep_acronyms = FALSE)
news_articles_toks <- tokens_select(news_articles_toks, stopwords("danish"), selection = "remove")
news_articles_toks <- tokens_wordstem(news_articles_toks)

pol_decisions_toks <- tokens(
  pol_decisions_corpus,
  what = "word",
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_numbers = TRUE,
  remove_url = TRUE,
  remove_separators = TRUE,
  verbose = TRUE)

pol_decisions_toks <- tokens_tolower(pol_decisions_toks, keep_acronyms = FALSE)
pol_decisions_toks <- tokens_select(pol_decisions_toks, stopwords("danish"), selection = "remove")
pol_decisions_toks <- tokens_wordstem(pol_decisions_toks)

news_articles_dfm <- dfm(news_articles_toks)
pol_decisions_dfm <- dfm(pol_decisions_toks)

cosine <- textstat_simil(
  news_articles_dfm,
  y = pol_decisions_dfm,
  selection = NULL,
  margin = c("documents"),
  method = c("cosine"))

cosine <- as.data.frame(cosine)
cosine <- cosine[order(-cosine$cosine),]
write_xlsx(cosine, "Test.xlsx")

My problem is that when I run the textstat_simil function, R returns cosine values for all combinations - both within and between the two sets of documents. But I don't want to know the cosine similarity between two news articles or between two political decisions. I only want to know the cosine similarity between a news article and a political decision.

Is there any way to solve this issue?

score 2 · Answer 1 · answered Jun 25 '22 at 06:55

Only use x and y in textstat_simil().

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.textstats)
#> Loading required package: quanteda.textstats

corp_news <- corpus(c(news1 = "politics party vote", 
                      news2 = "crime police family"))
corp_pol <- corpus(c(pol1 = "member party vote", 
                     pol2 = "family income", 
                     pol3 = "crime prison"))

dfmt_news <- tokens(corp_news) %>% dfm()
dfmt_pol <- tokens(corp_pol) %>% dfm()

dfmt_news
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    politics party vote crime police family
#>   news1        1     1    1     0      0      0
#>   news2        0     0    0     1      1      1
dfmt_pol
#> Document-feature matrix of: 3 documents, 7 features (66.67% sparse) and 0 docvars.
#>       features
#> docs   member party vote family income crime prison
#>   pol1      1     1    1      0      0     0      0
#>   pol2      0     0    0      1      1     0      0
#>   pol3      0     0    0      0      0     1      1

textstat_simil(x = dfmt_news, y = dfmt_pol, method = "cosine")
#> textstat_simil object; method = "cosine"
#>        pol1  pol2  pol3
#> news1 0.667     0     0
#> news2     0 0.408 0.408

^{Created on 2022-06-25 by the reprex package (v2.0.1)}

How can I calculate cosine similarity between two sets of individual documents, using quanteda?

1 Answers1