Fastest way to check occurrence of a set of substrings in a large collection of documents using R

Question

I have a large collection of documents, dc, (with several million rows) with the following data.frame structure

doc_id    body
  1       'sdfadfs...'
  2       'dfadf...'
  3       'sadf....'

I also have about 10,000 terms (or substrings) stored in terms (terms=c('sfa','adfa','dfad',...)).

I want to find the occurrence of each term for each document of dc. In the end, I want the result to be something like this

doc_id term
  1    'sfa'
  1    'dfad'
  2    'adfa'
  3    'sfa'
  3    'dfad'

Currently, I'm using the following code (with the help of stri_detect_fixed)

res_all=lapply(terms,function(term_i){ #loop over each term
    res=stri_detect_fixed(dc$body,term_i) #check occurrence of one term in each document
    
    data.frame(doc_id=which(res),term=term_i)
   
  })
  
bind_rows(res_all)

However, the above code is quite slow. Is there anything I can do to speed up the code?

For a given document, do you just want to know _if_ a term occurs? Or do you want to know _how many times_ a term occurs? — Mikael Jagan, Feb 14 '22 at 00:41
Have you tried parallelizing the loop (e.g., replacing `lapply` with `parallel::mclapply`)? Or thinking about optimizations based on nesting of terms? If a string doesn't have `"a"` as a substring, then it can't have `"ab"` as a substring, etc. — Mikael Jagan, Feb 14 '22 at 00:53
FWIW, creating a list of several thousand data frames is quite inefficient. It would be better to have your function return just `which(res)`. Then you could construct your data frame efficiently as `data.frame(doc_id = unlist(res_all, FALSE, FALSE), term = rep.int(gl(length(terms), 1L, labels = terms), lengths(res_all, FALSE)))`. — Mikael Jagan, Feb 14 '22 at 01:14
Looks like the bottleneck for this task could be in the data I/O of reading millions of rows for many documents rather than the matching itself. Perhaps worth considering shooting the `awk` folks a question? — Donald Seinen, Feb 14 '22 at 02:14
Without proper MRE, it is hard to know which method is fastest. Given that your data is supposed to be quite substantial, it might be that the bottleneck is not the processing time itself, but the I/O, as @DonaldSeinen suggests. Please provide MRE that capture your problem. Only then we can try to do optimization. — Colombo, Feb 14 '22 at 02:30
@MikaelJagan Thanks! I tried parallelizing, however, because I have quite a few document collections, the memory usage is a problem. Also, all terms are unique, therefore nesting does not help much. — Ding Li, Feb 14 '22 at 08:46

score 0 · Answer 1 · answered Feb 14 '22 at 01:42

Collapse the term values in one pattern and use it in stringr::str_extract_all which will return a list. Use unnest to get each term as separate row.

library(dplyr)

terms=c('sfa','adfa','dfad', ...)
pat <- sprintf('\\b(%s)\\b', paste0(terms, collapse = '|'))

df %>%
  mutate(term = stringr::str_extract_all(body, pat)) %>%
  tidyr::unnest(term)

Word boundaries (\\b) are added to the pattern so that 'sfa' does not match with terms like 'asfa' or 'sfad' etc.

Fastest way to check occurrence of a set of substrings in a large collection of documents using R

1 Answers1