5

I am trying to gregexpr search for the locations of "ABCD" in a large string and "ABBD, ACCD, AAAD" in the same string. I want to output the "ABCD" search results and the "ABBD, ACCD, AAAD" search results in two separate columns of a data table.

My current approach is to use gregexpr separately, export each as a 1 column txt file, import each as matrices, sort each 1 column matrix so that the numbers ascend by row, column bind the two matrices, and convert the resulting two column matrix into a data table.

This approach seems to be highly inefficient when dealing with a extremely large string, and takes quite a while to complete. Is there any way to optimize the procedure? Thanks for your help!

# dummy string that is relatively short for this demo
x <- "ABCDACCDABBDABCDAAADACCDABBDABCD"

# SEARCH for 'ABCD' location
out1 <- gregexpr(pattern = "ABCD", x)
cat(paste(c(out1[[1]]), sep = "\n", collapse = "\n"), file = "~/out_1.txt")    

# SEARCH for 'A??D' location
outB <- gregexpr(pattern = "ABBD", x)
outC <- gregexpr(pattern = "ACCD", x)
outA <- gregexpr(pattern = "AAAD", x)
cat(paste(c(outA[[1]], outB[[1]], outC[[1]]), collapse = "\n"), file = "~/out_2.txt")

# Function that BINDS Matrices by column
cbind.fill <- function(...){
  nm <- list(...)
  nm <- lapply(nm, as.matrix)
  n <- max(sapply(nm, nrow))
  do.call(cbind, lapply(nm, function (x) rbind(x, matrix(, n-nrow(x), ncol(x)))))
}

# Load as Tables --> Sort by numbers increasing --> Matrices
mat1 <- as.matrix(read.table("~/out_1.txt"))
mat2.t <- (read.table("~/out_2.txt"))
mat2 <- as.matrix(mat2.t[order(mat2.t$V1),])

# Combine two matrices to create 2 column matrix 
comb_mat <- cbind.fill(mat1, mat2)
write.table(comb_mat, file = "~/comb_mat.txt", row.names = FALSE, col.names = FALSE)
alki
  • 3,334
  • 5
  • 22
  • 45

3 Answers3

6

If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and fixed = TRUE faster still (especially when each pattern is matched only a few times).

  • You can use sort() to sort the second column immediately, rather than storing an intermediate variable and then indexing it with order().
  • Your cbind.fill() function works, but the task of NA-padding can be accomplished easily with out-of-bounds indexing, for which R naturally returns NAs for the out-of-bounds indexes.

Hence:

x <- 'ABCDACCDABBDABCDAAADACCDABBDABCD';
out1 <- c(gregexpr('ABCD',x,fixed=T)[[1]]);
out2 <- sort(c(gregexpr('AAAD',x,fixed=T)[[1]],gregexpr('ABBD',x,fixed=T)[[1]],gregexpr('ACCD',x,fixed=T)[[1]]));
outmax <- max(length(out1),length(out2));
comb_mat <- cbind(out1[1:outmax],out2[1:outmax]);
comb_mat;
##      [,1] [,2]
## [1,]    1    5
## [2,]   13    9
## [3,]   29   17
## [4,]   NA   21
## [5,]   NA   25

You can then write comb_mat to a file as per your write.table() call.


Edit: As you (and now I) have discovered, gregexpr() performs surprisingly poorly on large strings, and your 237MB string is definitely a large string. From Fast partial string matching in R, we can use the stringi package to speed up performance. What follows is a demo of how to use stringi::stri_locate_all() to accomplish your requirement. Some notes:

  • For my own testing I created my own 237MB file, which is actually exactly 237,000,001 bytes in size. I basically used vim to repeat your 32 byte example string 7,406,250 times, totalling 237,000,000 bytes, and the extra byte comes from the LF appended by vim. I named my test file x, and you can see I load it with data.table::fread(), as read.table() was taking way too long.
  • I made a minor change to my NA-padding algorithm. Instead of using out-of-bounds indexing, I realized we can assign the lengths of the vectors to the max length, leveraging right-to-left associativity for the assignment operator. The advantage here is that we don't have to construct the index vector 1:outmax anymore.

Hence:

library('data.table');
library('stringi');
x <- fread('x',header=F)$V1;
## Read 1 rows and 1 (of 1) columns from 0.221 GB file in 00:00:03
system.time({ out1 <- stri_locate_all(x,regex='ABCD')[[1]][,'start']; });
##    user  system elapsed
##   3.687   0.359   4.044
system.time({ out2 <- stri_locate_all(x,regex='AAAD|ABBD|ACCD')[[1]][,'start']; });
##    user  system elapsed
##   4.938   0.454   5.404
length(out1);
## [1] 22218750
length(out2);
## [1] 37031250
length(out1) <- length(out2) <- max(length(out1),length(out2));
comb_mat <- cbind(out1,out2);
head(comb_mat);
##      out1 out2
## [1,]    1    5
## [2,]   13    9
## [3,]   29   17
## [4,]   33   21
## [5,]   45   25
## [6,]   61   37
tail(comb_mat);
##             out1      out2
## [37031245,]   NA 236999961
## [37031246,]   NA 236999973
## [37031247,]   NA 236999977
## [37031248,]   NA 236999985
## [37031249,]   NA 236999989
## [37031250,]   NA 236999993
nrow(comb_mat);
## [1] 37031250
Community
  • 1
  • 1
bgoldst
  • 34,190
  • 6
  • 38
  • 64
  • What does fixed=T do there? – alki Jul 04 '15 at 02:54
  • When I run this code, it seems to hang at defining out1. – alki Jul 04 '15 at 03:07
  • That's very strange. There's nothing in that line that should cause R to hang. When it hangs, are you running it on the example `x` string (which is only 32 characters) or on a much larger string? – bgoldst Jul 04 '15 at 03:09
  • On a much larger string, 237 MB – alki Jul 04 '15 at 03:10
  • That's a massive string! I'm trying to load a 237MB string into R, but it's taking forever just to load. Have you been able to successfully load and manipulate such a string previously? – bgoldst Jul 04 '15 at 03:20
  • Using fread, I was able to read the .fasta file (essentially a txt file) in only 3 seconds. Here is the .fasta file: https://www.dropbox.com/s/ionrh0s99j47ggq/HG19C01_v.zip?dl=0 – alki Jul 04 '15 at 03:24
  • You're right, I just switched from using `read.table()` to using `fread()` and was able to load my 237,000,001 byte file in seconds. And you're right, the `gregexpr()` call does seem to be taking forever. I'm surprised at the poor performance here. I just wrote a simple C++ program that can scan for all matches of a four-byte search string in the massive string in seconds, including incrementally building up a variable-length vector with the indexes of the matches. There must be something wrong with the implementation of `gregexpr()`. We should look for alternative string matching R functions. – bgoldst Jul 04 '15 at 03:35
  • Aha! I think perhaps `stringi::stri_locate_all()` might be perfect for us here. I'm looking into a solution using that now. See http://cran.r-project.org/web/packages/stringi/stringi.pdf#page=74. (Discovered `stringi` from http://stackoverflow.com/questions/24257850/fast-partial-string-matching-in-r.) – bgoldst Jul 04 '15 at 03:49
  • Yes, it does! This was a great question to work on, I think we both learnt a lot! – bgoldst Jul 04 '15 at 04:12
  • 2
    great answer! SO at its best. – MichaelChirico Jul 04 '15 at 04:51
4

You can simplify it with lookaheads, so you only have one regex with two captured components.

ms <- gregexpr("A(?=(BCD)|(BBD|CCD|AAD))", x, perl=T)
res <- attr(ms[[1]], "capture.start")
res[res>0] <- res[res>0]-1

In this matrix, res, the first column is starting positions of ABCD, the second of the other three. If you want you can replace the zeros with NA.

# [1,]  1  0
# [2,]  0  5
# [3,]  0  9
# [4,] 13  0
# [5,]  0 17
# [6,]  0 21
# [7,]  0 25
# [8,] 29  0
Rorschach
  • 31,301
  • 5
  • 78
  • 129
4

Another approach using the stringi package:

library(stringi)

x <- 'ABCDACCDABBDABCDAAADACCDABBDABCD'

m <- stri_locate_all_regex(x, c('ABCD', 'AAAD|ABBD|ACCD'))

l <- list(m[[1]][,'start'], m[[2]][,'start'])
do.call(cbind, lapply(l, `[`, seq_len(max(sapply(l, length)))))

#      [,1] [,2]
# [1,]    1    5
# [2,]   13    9
# [3,]   29   17
# [4,]   NA   21
# [5,]   NA   25

Or you can try with using the zoo package:

m <- coredata(do.call(cbind, lapply(l, zoo)))
colnames(m) <- NULL

#      [,1] [,2]
# [1,]    1    5
# [2,]   13    9
# [3,]   29   17
# [4,]   NA   21
# [5,]   NA   25
hwnd
  • 69,796
  • 4
  • 95
  • 132