EDITED Entirely revised for a third time to better address the question! Incidentally, the key solution here (in the form of three helper functions) does not require the Biostrings
package.
As I understand the problem, a short DNA query sequence is to be matched against a large number of reference DNA sequences. The twist here is that an arbitrary number of variations in the form of insertions or deletions on the DNA query sequence are to be searched for in the reference DNA sequences.
The function vmatchPattern()
from the Biostrings
package can identify matches of a given pattern with an arbitrary number of mismatches in a set of reference sequences. In addition, vmatchPattern()
can identify matches of a given pattern with possible insertions or deletions (indel). However, unlike matchPattern()
, vmatchPattern()
cannot do both at the same time.
The solution sought here is to generate generate variations of the DNA query sequence that can then be passed to a search function such as grepl()
or as suggested here, vmatchPattern()
.
The revised solution posted here includes three functions. makeDel()
will generate all possible variants of a short sequence with an arbitrary number of deletions. The companion function, makeIns()
will generate variants of the short sequence with the insertion specified as the IUPAC symbol in symbol
. makeSub()
will make the desired substitutions using the bases specified by the IUPAC code in symbol
. This approach, generating all possible combinations of other bases, allows the character strings to be directly used in pattern-matching functions including vmatchPaterrn
.
If it is going to be used, this ensures that the package Biostrings
is available. This code applies to versions of R at 3.60 and beyond.
if (!require("Biostrings")) {
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Biostrings")
}
library(Biostrings)
Now some test data. The original query sequence "GGCGACTG"
will be used
as the "query" and 1000 random sequences between 200 and 400 nucleotides will be used as the reference set.
seq1 <- "GGCGACTG"
set.seed(1234)
ref <- replicate(1e3,
sample(c("A", "C", "G", "T"), sample(200:400, 1), replace = TRUE),
simplify = FALSE)
ref <- sapply(ref, paste, collapse = "")
ref <- DNAStringSet(ref)
Before proceeding with the solution, let's peek at what can be found with the pattern alone.
# how times does the pattern occur?
table(vcountPattern(seq1, ref)) # just 3 times
> 0 1
> 997 3
# how many times allowing for one mismatch?
# once in 96 sequences and twice in three sequences
n <- vcountPattern(seq1, ref, max.mismatch = 1)
table(n)
> n
> 0 1 2
> 901 96 3
# examine the matched sequences
m <- vmatchPattern(seq1, ref, max.mismatch = 1) # find the patterns
sel <- which.max(n) # select the first one with 2 matches
Views(ref[[sel]], m[[sel]]) # examine the matches
> Views on a 384-letter DNAString subject
> subject: TCGCGTCGCACTTCTGCTAACACAGC...GCCCAGTCGACTGCTGCTCGGATTGC
> views:
> start end width
> [1] 104 111 8 [GGCGACCG]
> [2] 364 371 8 [GTCGACTG]
Here are the three helper functions to generate the variants. The argument seq
can be a character string such as "GGGCGACTG" or a DNAString
object. The argument n
is an integer that specifies the upper limit on deletions, insertions, or substitutions. These functions will create variants with 0, 1, ..., n deletions, insertions or substitutions. If n
is set to 0, the function will return the original sequence. The argument symbol
for makeIns()
and makeSub()
should be a single IUPAC character to specify which bases will be inserted or substituted. The default value of "N" specifies all possible bases ("A", "C", "G" and "T").
makeDel()
use combn()
to identify the deletion positions. The logic for makeIns()
and makeSub()
is a bit more complex because of the need to allow insertions to be adjacent to each other and the need to create all combinations. Here I chose not to add insertions at the beginning or end of the query sequence.
All functions return a character vector suitable for use in vmatchPattern()
or grep()
.
To create deletions in a DNA string:
##
## makeDel - create 1:n deletions in a character string (DNA sequence)
## return a character vector of all possible variants
##
makeDel <- function(seq, n) {
# accept only a single value for 'seq'
cseq <- as.character(seq)
cseq <- unlist(strsplit(cseq[1], ""))
nseq <- length(cseq)
# simple argument checks
if (!is(n, "numeric")) stop("'n' must be an integer")
if (n == 0) return(paste(cseq, collapse = ""))
if (n >= nseq) stop("too many deletions for ", nseq, " letters")
# create all possible combinations to be dropped in 'index'
index <- lapply(seq_len(n), function(j) combn(nseq, j, simplify = FALSE))
index <- unlist(index, recursive = FALSE)
# drop base in each possible position and reassemble
ans <- lapply(index, function(idx) cseq[-idx])
ans <- sapply(ans, paste, collapse = "")
ans <- unique(ans) # remove duplicates
return(ans)
}
To create insertions in a DNA string:
##
## makeIns - create 1:n insertions into DNA string (character vector)
## where each insertion is one of a given IUPAC-specified symbol
## return a character vector of all possible variants
##
makeIns <- function(seq, n, symbol = "N") {
# IUPAC codes for ambiguous bases
iupac <- c(N = "ACGT", A = "A", C = "C", G = "G", T = "T", M = "AC", R = "AG",
W = "AT", S = "CG", Y = "CT", K = "GT", V = "ACG", H = "ACT",
D = "AGT", B = "CGT")
# only accept single value for 'seq'
cseq <- as.character(seq)
cseq <- unlist(strsplit(cseq[1], ""))
nseq <- length(cseq)
# simple argument checks
if (!is(n, "numeric")) stop("'n' must be an integer")
symbol <- toupper(symbol)
if (nchar(symbol) != 1 | !symbol %in% names(iupac))
stop("'symbol' must be a single valid IUPAC symbol")
if (n == 0) return(paste(cseq, collapse = ""))
if (n >= nseq) stop("seems like too many insertions for ", nseq, " letters")
# which bases are to be inserted?
ACGT <- strsplit(iupac[symbol], "")[[1]]
# create all possible combinations of positions for the insertion
ipos <- seq_len(nseq - 1) # insert after this position
index <- lapply(1:n, function(i) do.call(expand.grid, rep(list(ipos), i)))
index <- lapply(index, function(v) split(v, seq_len(nrow(v))))
index <- unlist(index, recursive = FALSE)
index <- lapply(index, unlist)
index <- lapply(index, sort)
# place the required number of insertions after each position in index
res <- lapply(index, function(idx) {
tally <- rle(idx)$lengths
breaks <- diff(c(0, idx, nseq))
NN <- Map(base::rep, symbol, tally)
spl <- split(cseq, rep(seq_along(breaks), breaks))
sel <- head(seq_along(spl), -1)
spl[sel] <- Map(base::c, spl[sel], NN)
ans <- unlist(spl)
if (length(ACGT) > 1) { # replicate and replace with appropriate bases
sites <- grep(symbol, ans)
nsites <- length(sites)
nsymbol <- length(ACGT)
bases <- expand.grid(rep(list(ACGT), nsites), stringsAsFactors = FALSE)
bases <- as.matrix(bases)
nvars <- nrow(bases)
ans <- do.call(rbind, rep(list(ans), nvars))
ans[, sites] <- bases
ans <- split(ans, seq_len(nvars))
ans <- lapply(ans, paste, collapse = "")
}
else
ans <- paste(ans, collapse = "")
return(ans)
})
res <- unlist(res)
res <- unique(res)
return(res)
}
To create substitutions in DNA string:
##
## makeSub - create an arbitrary number of substitutions in each 1:n positions
## with the IUPAC bases specified by 'symbol'
## return a character vector with all possible variants
##
makeSub <- function(seq, n, symbol = "N")
{
# IUPAC codes for ambiguous bases
iupac <- c(N = "ACGT", A = "A", C = "C", G = "G", T = "T", M = "AC", R = "AG",
W = "AT", S = "CG", Y = "CT", K = "GT", V = "ACG", H = "ACT",
D = "AGT", B = "CGT")
# accept only a single value for 'seq'
cseq <- as.character(seq)
cseq <- unlist(strsplit(cseq[1], ""))
nseq <- length(cseq)
# simple argument checks
if (!is(n, "numeric")) stop("'n' must be an integer")
symbol <- toupper(symbol)
if (nchar(symbol) != 1 | !symbol %in% names(iupac))
stop("'symbol' must be a single valid IUPAC symbol")
if (n == 0) return(paste(cseq, collapse = ""))
if (n > nseq) stop("too many substitutions for ", nseq, " bases")
# which bases are to be used for the substitution?
ACGT <- strsplit(iupac[symbol], "")[[1]]
# create all possible combinations of positions to be changed in 'index'
index <- lapply(seq_len(n), function(j) combn(nseq, j, simplify = FALSE))
index <- unlist(index, recursive = FALSE)
# for each numeric vector in index, create as many variants as
# alternative bases are needed, collect in 'ans'
ans <- lapply(index, function(idx) {
bases <- lapply(cseq[idx], function(v) setdiff(ACGT, v))
bases <- bases[sapply(bases, length) > 0] # defensive
bases <- expand.grid(bases, stringsAsFactors = FALSE)
bases <- as.matrix(bases)
nvars <- nrow(bases)
vars <- do.call(rbind, rep(list(cseq), nvars))
vars[ ,idx] <- bases
if (!is.null(vars))
return(split(vars, seq_len(nvars)))
})
ans <- unlist(ans, recursive = FALSE)
ans <- sapply(ans, paste, collapse = "")
ans <- unique(ans) # remove duplicates
return(ans)
}
Examples of the output:
makeDel(seq1, 0)
> [1] "GGCGACTG"
makeDel(seq1, 1)
> [1] "GCGACTG" "GGGACTG" "GGCACTG" "GGCGCTG" "GGCGATG" "GGCGACG" "GGCGACT"
makeDel(seq1, 2)
> [1] "GCGACTG" "GGGACTG" "GGCACTG" "GGCGCTG" "GGCGATG" "GGCGACG" "GGCGACT"
> [8] "CGACTG" "GGACTG" "GCACTG" "GCGCTG" "GCGATG" "GCGACG" "GCGACT"
> [15] "GGGCTG" "GGGATG" "GGGACG" "GGGACT" "GGCCTG" "GGCATG" "GGCACG"
> [22] "GGCACT" "GGCGTG" "GGCGCG" "GGCGCT" "GGCGAG" "GGCGAT" "GGCGAC"
makeIns(seq1, 1) # default form
> [1] "GAGCGACTG" "GCGCGACTG" "GGGCGACTG" "GTGCGACTG" "GGACGACTG" "GGCCGACTG"
> [7] "GGTCGACTG" "GGCAGACTG" "GGCGGACTG" "GGCTGACTG" "GGCGAACTG" "GGCGCACTG"
> [13] "GGCGTACTG" "GGCGACCTG" "GGCGAGCTG" "GGCGATCTG" "GGCGACATG" "GGCGACGTG"
> [19] "GGCGACTTG" "GGCGACTAG" "GGCGACTCG" "GGCGACTGG"
makeIns(seq1, 1, symbol = "Y") # inserting only "C" or "T"
> [1] "GCGCGACTG" "GTGCGACTG" "GGCCGACTG" "GGTCGACTG" "GGCTGACTG" "GGCGCACTG"
> [7] "GGCGTACTG" "GGCGACCTG" "GGCGATCTG" "GGCGACTTG" "GGCGACTCG"
makeSub("AAA", 1)
> [1] "CAA" "GAA" "TAA" "ACA" "AGA" "ATA" "AAC" "AAG" "AAT"
makeSub("AAA", 2)
> [1] "CAA" "GAA" "TAA" "ACA" "AGA" "ATA" "AAC" "AAG" "AAT" "CCA" "GCA" "TCA"
> [13] "CGA" "GGA" "TGA" "CTA" "GTA" "TTA" "CAC" "GAC" "TAC" "CAG" "GAG" "TAG"
> [25] "CAT" "GAT" "TAT" "ACC" "AGC" "ATC" "ACG" "AGG" "ATG" "ACT" "AGT" "ATT"
These functions can be used together with vmatchPattern()
to create variants and extract matches. One suggested approach would be to first find those sequences with mismatches using max.mismatch = 1
. Next, find sequences with deletions and with insertions using vmatchPattern()
with fixed = FALSE
and the default value of 0 for max.mismatch
.
Alternatively, the explicit patterns generated by the helper functions can be passed to grep
processes running in parallel! What follows shows the use of vmatchPattern
but there may be reasons to perform the analysis with different tools. See the comments on this topic.
# first, allow mismatches to the original pattern
# the result is a "ByPos_MIndex" object of length 1000
m1 <- vmatchPattern(seq1, ref, max.mismatch = 1) # as before...
n1 <- elementNROWS(m1) # counts the number of elements (matches)
which(n1 > 0) # which of the 1000 ref sequence had a match with 0 or 1 mismatches?
> [1] 14 71 73 76 79 88 90 108 126 129 138 141 150 160 163 179 180 195 200
> [20] 205 212 225 227 239 241 246 247 255 276 277 280 299 310 335 338 345 347 357
> [39] 359 369 378 383 387 390 391 404 409 410 414 418 469 472 479 488 499 509 523
> [58] 531 533 567 571 574 580 588 590 591 594 601 634 636 646 654 667 679 685 694
> [77] 696 713 717 732 734 737 749 750 761 762 783 815 853 854 857 903 929 943 959
> [96] 969 981 986 998
# Second search each of the patterns with lapply
# generates seven lists of objects, each of length 10000
pat2 <- makeDel(seq1, 1)
m2 <- lapply(pat2, function(pat) vmatchPattern(pat, ref))
# generates 22 lists of objects, each of length 10000
pat3 <- makeIns(seq1, 1)
m3 <- lapply(pat3, function(pat) vmatchPattern(pat, ref))
The second and third results in m2
and m3
are lists of "ByPos_MIndex" objects. The example below extracts the number of matches from m2
and shows these matches in an abbreviated form with str()
. Each value in a list identifies the reference sequence that had at least one match with respective pattern.
n2 <- lapply(m2, elementNROWS)
str(sapply(n2, function(n) which(n > 0)))
> List of 7
> $ : int [1:14] 14 138 179 335 369 391 567 679 713 734 ...
> $ : int [1:18] 138 200 240 298 310 343 510 594 598 599 ...
> $ : int [1:15] 21 26 45 60 260 497 541 600 607 642 ...
> $ : int [1:17] 27 54 120 121 123 132 210 242 244 257 ...
> $ : int [1:18] 15 33 110 126 154 419 528 539 546 606 ...
> $ : int [1:12] 24 77 79 139 525 588 601 679 770 850 ...
> $ : int [1:15] 179 345 378 414 469 571 574 580 591 713 ...
This final example examines the third list of 22 "ByPos_MIndex" objects (m3
) by the same mechanism. It shows that some of the 22 variants fail to match, some match once and five match twice.
n3 <- lapply(m3, elementNROWS) # extract all counts
names(n3) <- sprintf("pat_%02d", seq_along(n3)) # for convenience
str(lapply(n3, function(n) which(n > 0)))
> List of 22
> $ pat_01: int 679
> $ pat_02: int 391
> $ pat_03: int(0)
> $ pat_04: int 737
> $ pat_05: int(0)
> $ pat_06: int(0)
> $ pat_07: int 108
> $ pat_08: int 276
> $ pat_09: int 439
> $ pat_10: int [1:2] 764 773
> $ pat_11: int(0)
> $ pat_12: int [1:2] 22 820
> $ pat_13: int 795
> $ pat_14: int [1:2] 914 981
> $ pat_15: int(0)
> $ pat_16: int 112
> $ pat_17: int 884
> $ pat_18: int(0)
> $ pat_19: int [1:2] 345 378
> $ pat_20: int [1:2] 571 854
> $ pat_21: int 574
> $ pat_22: int(0)
Needless to say, a lot of data wrangling remains in order to extract sequence information. This can be coded with the help pages for matchPattern
and with some understanding of the pattern matching logic described in help("lowlevel-matching", package = "Biostrings")
.
Although the routines in Biostrings
use very fast and very memory-efficient algorithms for handling large sequences. Joe seems to find raw searching faster under other circumstances. There's always more to learn!