Fast calculation of the highest similarity score for multi-million-document corpus

Question

I need to find the highest similarity score of a document with all the documents prior to the generation of the document.

I plan to use the quanteda package in R and come up with the following code. dfm is the dfm matrix, which has more than 3 million documents and 4 million features. In each iteration, I compare the target document dfm[id_i,] with all the documents prior to the target document dfm_subset(dfm,date<date_i ). The resultant similarity score vecotr is stored in one_simil. I can obtain the highest similarity score from max(one_simil,na.rm=T).

Normally, dfm_subset(dfm,date<date_i ) has more than 1 million documents, so the computation of one_simil is quite expensive, taking around 1 minute to finish. Since I need to get the highest similarity for around 1 million documents, the total computation time is just too long (about 2,000,000 mins).

I wonder is there any way to speed up the calculation? My thought is that I'm only interested in the highest similarity score, so I do not need to compare dfm[id_i,] with every document in dfm_subset(dfm,date<date_i ), therefore, there should exist room for improvement. But I don't know how. Any suggestion is welcomed!

similarity_res=vector("list",nrow(to_find_docs)) #store the result
for(row_i in 1:nrow(to_find_docs)){
  id_i=to_find_docs$id[row_i]
  date_i=to_find_docs$date[row_i]
  
  
  one_simil= textstat_simil(
    dfm_subset(dfm,date<date_i ),  #the comparison documents
    dfm[id_i,],  #the document to find similarity
    margin = "documents", method = "cosine")
  

  
  similarity_res[[row_i]]=data.frame(id=id_i,
                                     highest_similarity=max(one_simil,na.rm = T)
  )
  
}

May some usefull things [here](https://stackoverflow.com/questions/48058104/efficient-string-similarity-grouping/48096986#comment83387786_48096986) ? — MrSmithGoesToWashington, Sep 12 '22 at 08:15
See https://github.com/koheiw/proxyC. This package is the workhorse behind the `textstat_simil()`. It has additional functions like `min_simil` and `rank` to make computation faster. — Kohei Watanabe, Sep 12 '22 at 09:13

score 0 · Answer 1 · answered Sep 12 '22 at 16:12

As @Kohei states in the comment, you can use the insanely efficient underlying proxyC library to make this fast. While you do not specify a target output for your question, this should guide you. Use the min_simil argument to set a very high minimum similarity, and then find the most similar set within the resulting matrix (which will be uncomputed for any document whose similarity is below the minimum).

Here is an example with the 200+ document State of the Union corpus:

library("quanteda")
#> Package version: 3.2.3
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")
library("Matrix")

data("data_corpus_sotu", package = "quanteda.corpora")

dfmat <- data_corpus_sotu |>
    corpus_subset(Date > "1821-01-01") |>
    tokens() |>
    dfm()

similmat <- textstat_simil(dfmat, method = "cosine", min_simil = .99)
diag(similmat) <- NA

idx <- which(similmat == max(similmat, na.rm = TRUE), arr.ind = TRUE)
rownames(idx)
#> [1] "McKinley-1899" "McKinley-1900"
similmat[rownames(idx)[1], rownames(idx)[2]]
#> [1] 0.9969676

^{Created on 2022-09-12 with reprex v2.0.2}

Fast calculation of the highest similarity score for multi-million-document corpus

1 Answers1