I have a large collection of documents, dc
, (with several million rows) with the following data.frame
structure
doc_id body
1 'sdfadfs...'
2 'dfadf...'
3 'sadf....'
I also have about 10,000 terms (or substrings) stored in terms
(terms=c('sfa','adfa','dfad',...)
).
I want to find the occurrence of each term for each document of dc
. In the end, I want the result to be something like this
doc_id term
1 'sfa'
1 'dfad'
2 'adfa'
3 'sfa'
3 'dfad'
Currently, I'm using the following code (with the help of stri_detect_fixed
)
res_all=lapply(terms,function(term_i){ #loop over each term
res=stri_detect_fixed(dc$body,term_i) #check occurrence of one term in each document
data.frame(doc_id=which(res),term=term_i)
})
bind_rows(res_all)
However, the above code is quite slow. Is there anything I can do to speed up the code?