I have a fairly large set of strings in R:
set.seed(42)
strings <- sapply(1:250000, function(x) sample(2:20, 1, prob=c(
0.001, 0.006, 0.021, 0.043, 0.075, 0.101, 0.127,
0.138, 0.132, 0.111, 0.087, 0.064, 0.042, 0.025, 0.014, 0.008,
0.004, 0.002, 0.001)))
strings <- lapply(strings, function(x) sample(letters, x, replace=TRUE))
strings <- sapply(strings, paste, collapse='')
I would like to make a list denoting the presence or absence of each element from a list of substrings within these strings. My starting point, of course, is some code from stackoverflow:
#0.1 seconds
substrings <- sample(strings, 10)
system.time(matches <- lapply(substrings, grepl, strings, fixed=TRUE))
However, this approach is somewhat naive for larger sets of substrings, as it stores all of the matches and all of the non-matches:
#13 seconds
substrings <- sample(strings, 1000)
system.time(matches <- lapply(substrings, grepl, strings, fixed=TRUE))
We can reduce the size of the output object by only storing the matches:
#13 seconds
substrings <- sample(strings, 1000)
system.time(matches <- lapply(substrings, function(x) which(grepl(x, strings, fixed=TRUE))))
But this is still slow for large numbers of substrings:
#316 seconds
substrings <- sample(strings, 25000)
system.time(matches <- lapply(substrings, function(x) which(grepl(x, strings, fixed=TRUE))))
It's nice that the time is growing linearly, but I feel like there has to be a much faster way to accomplish this task, perhaps by avoiding the lapply
loop.
How can I speed up this many-to-many string matching function?
/edit: One easy speedup is parallelization:
#Takes about 99 seconds
require('doParallel')
cl <- makeForkCluster(nnodes=8)
registerDoParallel(cl)
system.time(matches <- foreach(i=1:length(substrings)) %dopar% {
which(grepl(substrings[i], strings, fixed=TRUE))
})
stopCluster(cl)
However, I think most solutions to this problem will be easy to parallelize, once a fast serial algorithm has been found.