I need to find the highest similarity score of a document with all the documents prior to the generation of the document.
I plan to use the quanteda
package in R and come up with the following code.
dfm
is the dfm matrix, which has more than 3 million documents and 4 million features. In each iteration, I compare the target document dfm[id_i,]
with all the documents prior to the target document dfm_subset(dfm,date<date_i )
. The resultant similarity score vecotr is stored in one_simil
. I can obtain the highest similarity score from max(one_simil,na.rm=T)
.
Normally, dfm_subset(dfm,date<date_i )
has more than 1 million documents, so the computation of one_simil
is quite expensive, taking around 1 minute to finish. Since I need to get the highest similarity for around 1 million documents, the total computation time is just too long (about 2,000,000 mins).
I wonder is there any way to speed up the calculation? My thought is that I'm only interested in the highest similarity score, so I do not need to compare dfm[id_i,]
with every document in dfm_subset(dfm,date<date_i )
, therefore, there should exist room for improvement. But I don't know how. Any suggestion is welcomed!
similarity_res=vector("list",nrow(to_find_docs)) #store the result
for(row_i in 1:nrow(to_find_docs)){
id_i=to_find_docs$id[row_i]
date_i=to_find_docs$date[row_i]
one_simil= textstat_simil(
dfm_subset(dfm,date<date_i ), #the comparison documents
dfm[id_i,], #the document to find similarity
margin = "documents", method = "cosine")
similarity_res[[row_i]]=data.frame(id=id_i,
highest_similarity=max(one_simil,na.rm = T)
)
}