3

I have a fairly large set of strings in R:

set.seed(42)
strings <- sapply(1:250000, function(x) sample(2:20, 1, prob=c(
  0.001, 0.006, 0.021, 0.043, 0.075, 0.101, 0.127, 
  0.138, 0.132, 0.111, 0.087, 0.064, 0.042, 0.025, 0.014, 0.008, 
  0.004, 0.002, 0.001)))
strings <- lapply(strings, function(x) sample(letters, x, replace=TRUE))
strings <- sapply(strings, paste, collapse='')

I would like to make a list denoting the presence or absence of each element from a list of substrings within these strings. My starting point, of course, is some code from stackoverflow:

#0.1 seconds
substrings <- sample(strings, 10)
system.time(matches <- lapply(substrings, grepl, strings, fixed=TRUE)) 

However, this approach is somewhat naive for larger sets of substrings, as it stores all of the matches and all of the non-matches:

#13 seconds
substrings <- sample(strings, 1000)
system.time(matches <- lapply(substrings, grepl, strings, fixed=TRUE)) 

We can reduce the size of the output object by only storing the matches:

#13 seconds
substrings <- sample(strings, 1000)
system.time(matches <- lapply(substrings, function(x) which(grepl(x, strings, fixed=TRUE))))

But this is still slow for large numbers of substrings:

#316 seconds
substrings <- sample(strings, 25000)
system.time(matches <- lapply(substrings, function(x) which(grepl(x, strings, fixed=TRUE))))

It's nice that the time is growing linearly, but I feel like there has to be a much faster way to accomplish this task, perhaps by avoiding the lapply loop.

How can I speed up this many-to-many string matching function?

/edit: One easy speedup is parallelization:

#Takes about 99 seconds
require('doParallel')
cl <- makeForkCluster(nnodes=8)
registerDoParallel(cl)
system.time(matches <- foreach(i=1:length(substrings)) %dopar% {
  which(grepl(substrings[i], strings, fixed=TRUE))
})
stopCluster(cl)

However, I think most solutions to this problem will be easy to parallelize, once a fast serial algorithm has been found.

Community
  • 1
  • 1
Zach
  • 29,791
  • 35
  • 142
  • 201
  • 1
    ?%in% , that kinda what you're looking for? – James Tobin Mar 18 '14 at 20:30
  • @James Tobin sort of. Take a look at the output of my functions. What I want to do is somewhat like `%in%`, except I want to do partial matches too. For example, the substring `marble` should return `TRUE` for each string in `c("marble", "emmarble", "marbleization", "marblehearted")`. – Zach Mar 18 '14 at 20:36
  • gotcha.. I'm on a windows machine, so I couldn't get your example going, but now I'm interested, so will pursue as well – James Tobin Mar 18 '14 at 20:56
  • @James Tobin Does your machine have a dictionary file somewhere? If not, you can try this wordlist, which is about 1/4 of the size of my dictionary: `strings <- unique(tolower(scan("http://scrapmaker.com/data/wordlists/twelve-dicts/5desk.txt", what="")))` – Zach Mar 18 '14 at 21:02
  • I suspect a hash table would be of use but your question is in no way minimal or reproducible. Where's the data? – Tyler Rinker Mar 18 '14 at 21:21
  • @Tyler Rinker most *nix systems have a dictionary under `/usr/share/dict/words`. If you would like, I can write some R code to generate gibberish wordlists to make the code reproducible on other systems. This code may take a lot longer to execute than loading a dictionary from a file. – Zach Mar 18 '14 at 21:25
  • 2
    @TylerRinker See my most recent edit for an example that is reproducible on non *nix systems. – Zach Mar 18 '14 at 21:36
  • @Zach you just made the edit with parallel processing =) I use snowfall, but I expect the time results were similar, mine took 50% less time with parallel on the 25000 set – James Tobin Mar 18 '14 at 21:40
  • This problem should be well researched - especially in the field of bioinformatics. Of course you would have to apply a sepcialized algorithm using keyword-trees f.x. to get around the O(n²). http://tinyurl.com/prdmg4h – Raffael Mar 18 '14 at 22:06
  • @Яaffael Wonderful! Could you help me write (or find) some R code to solve it? – Zach Mar 18 '14 at 22:09

0 Answers0