4

I have a DNA sequence, and I want to find all instances of it, or any of its possible mutations in a list of DNA sequence reads. I am using grepl to do this, since it is faster than matchPattern in the instance I am using it. I use parLapply to feed my vector of mutations to the grepl function. But what I am interested in doing is making an easy way of auto-generating my vector of sequence mutations. Originally I typed each mutation, but that leaves room for human error, and if the sequence is lengthened, more mutations would need to be typed. In addition, my current code only allows 1 mutation, and some sequences should allow for more mutations than others. I am not looking for someone to write a loop for me, but just give me a suggestion for accounting for any string.

Right now, I have a semi-automated way of generating the mutations. It now generates the vector without me typing them all out, but only works for 8 nucleotide long sequences. There has to be a better way to generate the vector for any nucleotide sequence of any length.

This is my code:

#My sequence of interest
seq1 <- "GGCGACTG"
lenseq1 <- nchar(seq1)

#A vector of the length of the sequence I wish to create all mutations of
mutsinseq1 <- rep(seq1, 5*lenseq1+4*(lenseq1-1)+1)

#The possible substitutions, insertions, and deletions to the sequence of interest
possnuc <- c("A","T","C","G","")
lenpossnuc <- length(possnuc)

#changing all elements of the vector except for the first
#the first 8 if statements are nucleotide substitutions or deletions
#the other if statements allow for inserts between nucleotides
for(i in 2:length(mutsinseq1)){
  if(i<7){
    mutsinseq1[i] <- paste(possnuc[i-1],substr(seq1,2,lenseq1),sep = "") 
  } else if(i<12){
    mutsinseq1[i] <- paste(substr(seq1,1,1),possnuc[i-6],substr(seq1,3,lenseq1),sep = "")
  } else if(i<17){
    mutsinseq1[i] <- paste(substr(seq1,1,2),possnuc[i-11],substr(seq1,4,lenseq1),sep = "")
  } else if(i<22){
    mutsinseq1[i] <- paste(substr(seq1,1,3),possnuc[i-16],substr(seq1,5,lenseq1),sep = "")
  } else if(i<27){
    mutsinseq1[i] <- paste(substr(seq1,1,4),possnuc[i-21],substr(seq1,6,lenseq1),sep = "")
  } else if(i<32){
    mutsinseq1[i] <- paste(substr(seq1,1,5),possnuc[i-26],substr(seq1,7,lenseq1),sep = "")
  } else if(i<37){
    mutsinseq1[i] <- paste(substr(seq1,1,6),possnuc[i-31],substr(seq1,8,lenseq1),sep = "")
  } else if(i<42){
    mutsinseq1[i] <- paste(substr(seq1,1,7),possnuc[i-36],sep = "")
  } else if(i<46){
    mutsinseq1[i] <- paste(substr(seq1,1,1),possnuc[i-41],substr(seq1,2,lenseq1),sep = "")
  } else if(i<50){
    mutsinseq1[i] <- paste(substr(seq1,1,2),possnuc[i-45],substr(seq1,3,lenseq1),sep = "")
  } else if(i<54){
    mutsinseq1[i] <- paste(substr(seq1,1,3),possnuc[i-49],substr(seq1,4,lenseq1),sep = "")
  } else if(i<58){
    mutsinseq1[i] <- paste(substr(seq1,1,4),possnuc[i-53],substr(seq1,5,lenseq1),sep = "")
  } else if(i<62){
    mutsinseq1[i] <- paste(substr(seq1,1,5),possnuc[i-57],substr(seq1,6,lenseq1),sep = "")
  } else if(i<66){
    mutsinseq1[i] <- paste(substr(seq1,1,6),possnuc[i-61],substr(seq1,7,lenseq1),sep = "")
  } else{
    mutsinseq1[i] <- paste(substr(seq1,1,7),possnuc[i-65],substr(seq1,8,lenseq1),sep = "")
  }
}

#getting rid of duplicate mutations
mutsinseq1 <- mutsinseq1[-which(duplicated(mutsinseq1))]

The following is what I wish to produce (and is produced by my current code):

mutsinseq1
[1] "GGCGACTG"  "AGCGACTG"  "TGCGACTG"  "CGCGACTG"  "GCGACTG"   "GACGACTG"  "GTCGACTG"  "GCCGACTG"  "GGAGACTG"  "GGTGACTG"  "GGGGACTG"  "GGGACTG"   "GGCAACTG" 
[14] "GGCTACTG"  "GGCCACTG"  "GGCACTG"   "GGCGTCTG"  "GGCGCCTG"  "GGCGGCTG"  "GGCGCTG"   "GGCGAATG"  "GGCGATTG"  "GGCGAGTG"  "GGCGATG"   "GGCGACAG"  "GGCGACCG" 
[27] "GGCGACGG"  "GGCGACG"   "GGCGACTA"  "GGCGACTT"  "GGCGACTC"  "GGCGACT"   "GAGCGACTG" "GTGCGACTG" "GCGCGACTG" "GGGCGACTG" "GGACGACTG" "GGTCGACTG" "GGCCGACTG"
[40] "GGCAGACTG" "GGCTGACTG" "GGCGGACTG" "GGCGAACTG" "GGCGTACTG" "GGCGCACTG" "GGCGATCTG" "GGCGACCTG" "GGCGAGCTG" "GGCGACATG" "GGCGACTTG" "GGCGACGTG" "GGCGACTAG"
[53] "GGCGACTCG" "GGCGACTGG"

How do I solve the problem?

Emma
  • 27,428
  • 11
  • 44
  • 69
Joe Kowzun
  • 45
  • 8
  • Joe, I posted an answer that missed the mark. I'll replace my answer with one that I think *properly* addresses your question! When you say, "DNA sequence reads", I going to assume these are fairly short reads such as from an Illumina platform. – David O Jul 28 '19 at 01:40
  • Is this just an exercise or are you going to use it in a real project? In the former case, are you open to alternative algorithms (because yours is quite naïve). In the latter case, you might want to consider tools, such as BWA and blastn for the task. – Eli Korvigo Jul 28 '19 at 16:40
  • @David O, I am analyzing 100 nucleotide long Illumina sequences, but they are of a highly mutative virus, so the sequences I am searching for have a good chance of being mutated themselves (hence why I need to generate a list of possible mutations). – Joe Kowzun Jul 28 '19 at 19:02
  • @EliKorvigo I am aware my code is naive. I've only been coding for a month, so I don't have much experience. The problem with blast, is that I have millions of sequences to look in. In addition to this project, I am working on code to BLAST my sequences of interest. – Joe Kowzun Jul 28 '19 at 19:07
  • 2
    I appreciate what you are facing! My suggestions about taking advantage of vmatchPattern are very pertinent. Your search for a way to pass variant sequences is something that the maintainers indicate will come some time...In the meanwhile, your thoughts and my solution may be applicable. I do agree that with Eli that using grepl isn't really the strongest approach. We could continue in a chat room if you like. – David O Jul 28 '19 at 19:11
  • Sadly, because I only have 13 points on StackOverflow, I haven't yet unlocked the chat privilege. – Joe Kowzun Jul 28 '19 at 20:12
  • 1
    "The problem with blast, is that I have millions of sequences to look in" – which is why I've mentioned BWA as another potential tool to use. – Eli Korvigo Jul 28 '19 at 21:48
  • 2
    Hi Joe. You've convinced me! This is very similar to a problem that frustrated me before (with the related `matchPdict` functions). It's been fun revisiting this and I hope the *newest* answer of my provides useful functions. They are a bit large and poorly documented--so don't hesitate to complain. – David O Jul 29 '19 at 02:34

2 Answers2

3

In other languages, you might do this with a series of nested loops, but in R, there's some nice combinatorics functions. Here's the overall function to do what you want:

library(stringr)
library(purrr)
library(dplyr)

mutate_sequence <- function(string, num = 1, nucleotides = c("A","T","C","G","_")) {
  l_str <- str_length(string)

  choices <- cross(list(
    cols = combn(seq_len(l_str), num, simplify = F),
    muts = cross(rerun(num, nucleotides)) %>% map(unlist)
  ))

  choice_matrix <- 
    map_dfr(choices, as_tibble, .id = "rows") %>% 
    mutate(rows = as.numeric(rows))

  seq_matrix <- str_split(rep(string, max(choice_matrix$rows)), "", simplify = T)

  seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)
  apply(seq_matrix, 1, paste, collapse = "")
}

I used some packages to make things a little easier on me, but it could all be translated into base R.

Here's example output:

mutate_sequence("ATCG", num = 2)
  [1] "aaCG" "aTaG" "aTCa" "AaaG" "AaCa" "ATaa" "taCG" "tTaG" "tTCa" "AtaG" "AtCa" "ATta" "caCG" "cTaG"
 [15] "cTCa" "AcaG" "AcCa" "ATca" "gaCG" "gTaG" "gTCa" "AgaG" "AgCa" "ATga" "_aCG" "_TaG" "_TCa" "A_aG"
 [29] "A_Ca" "AT_a" "atCG" "aTtG" "aTCt" "AatG" "AaCt" "ATat" "ttCG" "tTtG" "tTCt" "AttG" "AtCt" "ATtt"
 [43] "ctCG" "cTtG" "cTCt" "ActG" "AcCt" "ATct" "gtCG" "gTtG" "gTCt" "AgtG" "AgCt" "ATgt" "_tCG" "_TtG"
 [57] "_TCt" "A_tG" "A_Ct" "AT_t" "acCG" "aTcG" "aTCc" "AacG" "AaCc" "ATac" "tcCG" "tTcG" "tTCc" "AtcG"
 [71] "AtCc" "ATtc" "ccCG" "cTcG" "cTCc" "AccG" "AcCc" "ATcc" "gcCG" "gTcG" "gTCc" "AgcG" "AgCc" "ATgc"
 [85] "_cCG" "_TcG" "_TCc" "A_cG" "A_Cc" "AT_c" "agCG" "aTgG" "aTCg" "AagG" "AaCg" "ATag" "tgCG" "tTgG"
 [99] "tTCg" "AtgG" "AtCg" "ATtg" "cgCG" "cTgG" "cTCg" "AcgG" "AcCg" "ATcg" "ggCG" "gTgG" "gTCg" "AggG"
[113] "AgCg" "ATgg" "_gCG" "_TgG" "_TCg" "A_gG" "A_Cg" "AT_g" "a_CG" "aT_G" "aTC_" "Aa_G" "AaC_" "ATa_"
[127] "t_CG" "tT_G" "tTC_" "At_G" "AtC_" "ATt_" "c_CG" "cT_G" "cTC_" "Ac_G" "AcC_" "ATc_" "g_CG" "gT_G"
[141] "gTC_" "Ag_G" "AgC_" "ATg_" "__CG" "_T_G" "_TC_" "A__G" "A_C_" "AT__"

I made the mutations lowercase or "_" to make it obvious where they are, but you can easily change that to get them back to "normal" sequences.

So each line does some things:

l_str <- str_length(string)

Gets the number of characters in the string.

combn(seq_len(l_str), num, simplify = F)

1) This is all possible combinations of positions along the sequence (indexes), taken num at a time, for the number of mutations.

rerun(num, nucleotides)

2) This repeats your vector of nucleotides num times, and makes it a list. cross(rerun(num, nucleotides)) then gives you every combination from that list as a list, so you're taking every possible combination of nucleotides, with repeats. cross(rerun(num, nucleotides)) %>% map(unlist) collapses the deepest level of the list into vectors.

So those last two chunks give you every possible choice of positions, and then every possible combination of replacements. Then we need every possible combination of those as pairs!

  choices <- cross(list(
    cols = combn(seq_len(l_str), num, simplify = F),
    muts = cross(rerun(num, nucleotides)) %>% map(unlist)
  ))

For the above output, that means:

[[1]]
[[1]]$`cols`
[1] 1 2

[[1]]$muts
[1] "A" "A"


[[2]]
[[2]]$`cols`
[1] 1 2

[[2]]$muts
[1] "T" "A"
...

So first for positions 1/2, it gives us A/A, T/A, G/A, C/A, _/A, etc. Then each combination again for positions 1/3, then positions 1/4, then 2/3, then 2/4, then 3/4.

So now you have this long list, and let's make it into something nicer. First we make each element into a dataframe with cols and muts, then bind them all into a single one with an identifier for each element called rows:

map_dfr(choices, as_tibble, .id = "rows")
# A tibble: 50 x 3
   rows   cols muts 
   <chr> <int> <chr>
 1 1         1 A    
 2 1         2 A    
 3 2         1 T    
 4 2         2 A    
 5 3         1 C    
 6 3         2 A    
 7 4         1 G    
 8 4         2 A    
 9 5         1 _    
10 5         2 A
# ... with 40 more rows

This gives us a long dataframe. Each of rows is one output string, and the cols tells us which position in the string will be replaces. muts is the characters that will go in those positions. In order to do the subsetting later, we'll then convert rows to numeric, using mutate(...).

seq_matrix <- str_split(rep(string, max(choice_matrix$rows)), "", simplify = T)

Now we take your original string and repeat it as many times as the choice_matrix tells us we'll have mutated sequences. Then we take that vector, and split every one along the character boundaries:

      [,1] [,2] [,3] [,4]
 [1,] "A"  "T"  "C"  "G" 
 [2,] "A"  "T"  "C"  "G" 
 [3,] "A"  "T"  "C"  "G" 
 [4,] "A"  "T"  "C"  "G" 
 [5,] "A"  "T"  "C"  "G" 
 [6,] "A"  "T"  "C"  "G" 
 ...

Now we have a big matrix, and R is fast at operating on these big matrices. We could have done all the other steps with matrix operations, but that seemed like more work than using this list-combination functions.

seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)

This identifies each position based on the rows and cols in the choice_matrix. Then it puts the appropriate value from muts in it. This is also where you can take out str_to_lower to keep them from being lowercase. You'd change the default argument of nucleotides to make the "_" into "".

       [,1] [,2] [,3] [,4]
  [1,] "a"  "a"  "C"  "G" 
  [2,] "a"  "T"  "a"  "G" 
  [3,] "a"  "T"  "C"  "a" 
  [4,] "A"  "a"  "a"  "G" 
  [5,] "A"  "a"  "C"  "a" 
  [6,] "A"  "T"  "a"  "a" 
  ...

So row 1 got "A" and "A" in positions 1 and 2. Then row 2 got "A" and "A" in positions 1 and 3, etc. Now we just have to apply across each row (that's what the 1 in apply(..., 1, ...) does) a function to combine each row into a single string. That would be paste(..., collapse = "").

This will get you huge output quickly. If you do 3 mutations on your original 8 nucleotide sequence, you get an output of 7000. 4 mutations is 43750. And each time gets that much slower, taking about 5s to run the 4 mutations on my desktop. You could precalculate the output length, which is choose(l_str, num) * length(nucleotides)^num.


Updated, again:

To handle insertions as well as deletions, we just need the character matrix to have a slot for every possible insertion. Here's that version:

mutate_sequence <- function(string, num = 1, nucleotides = c("A","T","C","G","")) {
  if (num < 1) {return(string)}

  l_str <- str_length(string)
  l_pos <- (num + 1)*(l_str - 1) + 1

  choices <- cross(list(
    cols = combn(seq_len(l_pos), num, simplify = F),
    muts = cross(rerun(num, nucleotides)) %>% map(unlist)
  ))

  choice_matrix <- 
    map_dfr(choices, as_data_frame, .id = "rows") %>% 
    mutate(rows = as.numeric(rows))

  blanks <- character(l_pos)
  orig_pos <- (seq_len(l_str) - 1) * (num+1) + 1
  blanks[orig_pos] <- str_split(string, "", simplify = T)

  seq_matrix <- matrix(
    rep(blanks, max(choice_matrix$rows)), 
    ncol = l_pos, byrow = T
    )

  seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)
  sequences <- apply(seq_matrix, 1, paste, collapse = "")
  sequences[!duplicated(str_to_upper(sequences))]
}

This does essentially the same as the version of the function above, but first you create a blank vector with enough spots for every insertion. For each original nucleotide, you need an additional spot to insert after it, except the last one. That works out to l_pos <- (num + 1)*(l_str - 1) + 1 positions. character(l_pos) gives you the blanks, and then you fill in the blanks with the original nucleotides at (seq_len(l_str) - 1) * (num+1) + 1.

For example, ATCG with two mutations becomes "A" "" "" "T" "" "" "C" "" "" "G". The rest of the function works the same, just putting every possible nucleotide (or deletion) in every possible spot.

The output before pasteing it all back together then looks like:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "a"  "a"  ""   "T"  ""   ""   "C"  ""   ""   "G"  
[2,] "a"  ""   "a"  "T"  ""   ""   "C"  ""   ""   "G"  
[3,] "a"  ""   ""   "a"  ""   ""   "C"  ""   ""   "G"  
[4,] "a"  ""   ""   "T"  "a"  ""   "C"  ""   ""   "G"  
[5,] "a"  ""   ""   "T"  ""   "a"  "C"  ""   ""   "G" 
...  

Then after pasteing each row, we can check for repeats with duplicated and exclude those. We could also just get rid of the lowercase mutations and use unique(sequences) instead. Now the output is much shorter than before, the first 55 of 278:

  [1] "aaTCG"  "aaCG"   "aTaCG"  "aTaG"   "aTCaG"  "aTCa"   "AaaTCG" "AaaCG"  "AaTaCG" "AaTaG"  "AaTCaG"
 [12] "AaTCa"  "AaaG"   "AaCaG"  "AaCa"   "ATaaCG" "ATaaG"  "ATaCaG" "ATaCa"  "ATaa"   "ATCaaG" "ATCaa" 
 [23] "taTCG"  "taCG"   "tTaCG"  "tTaG"   "tTCaG"  "tTCa"   "AtaTCG" "AtTaCG" "AtTaG"  "AtTCaG" "AtTCa" 
 [34] "ATta"   "ATCtaG" "ATCta"  "caTCG"  "caCG"   "cTaCG"  "cTaG"   "cTCaG"  "cTCa"   "AcaTCG" "AcaCG" 
 [45] "AcTaCG" "AcTaG"  "AcTCaG" "AcTCa"  "AcaG"   "AcCaG"  "AcCa"   "ATcaCG" "ATcCaG" "ATcCa"  "gaTCG"
...
Brian
  • 7,900
  • 1
  • 27
  • 41
  • 2
    No offense, I am grateful for the work you put into this, and the time you spent explaining this, but it is still missing the ability allow for insertions. Putting in all the two nucleotide possibilities into that nucleotide vector argument would make two mutations at once 3/4 of the time, while only being counted as one mutation, since the two character string is only replacing one string in the row in the seq_matrix. I don't even know how I would go about doing that with the method you have shown. – Joe Kowzun Jul 27 '19 at 14:36
  • Hmm, I didn't think of insertions, since you hadn't mentioned them in the Q. I'd have to think about how to do those in a matrix context, but it's certainly doable. It's also worth noting that sometimes the "mutations" with this method aren't changes at all, since A->A is a possible substitution. – Brian Jul 27 '19 at 16:20
  • I'm sorry. I should have stated that more clearly. In one of the comments in my sample code, I said that the latter 8 if statements allow for insertion mutations, and my vector in my question does include insertions (note that some strings are 9 characters instead of 8 or 7), but I should have been more explicit in my question. Since my background is in biology and genetics, I guess I assumed others would know the term mutations in this instance covers substitutions, insertions, and deletions. This isn't a biology forum, so I should have been more clear. – Joe Kowzun Jul 28 '19 at 19:47
  • 1
    @JoeKowzun I'm a recovering biologist, so I should have thought of that! I came up with a good way to handle insertions, if you like this approach. See my edits. – Brian Jul 29 '19 at 04:30
  • 1
    I'm just curious, why do you allow for insertions before the first nucleotide? In the case of pattern "ATCG", if "aa" gets inserted before "ATCG", the "ATCG" sequence goes undisturbed, and would still be found if you are pattern matching for "ATCG". The same is true for insertions after the end of the sequence. – Joe Kowzun Jul 29 '19 at 13:33
  • I'm sorry if I misrepresented what `makeIns` does. It does not place insertions at the beginning or end for the reasons you say. (If I didn't mess it up...) I think it's good. – David O Jul 29 '19 at 14:35
  • @JoeKowzun that's a good point I didn't think of, re: insertions wholly before or after the sequence. – Brian Jul 29 '19 at 17:13
  • 1
    @Brian I'm sorry, should have definitely stated before, but partial prior and post insertions have the same problem as wholly inserting prior or post sequence. Several of the "mutations" from ATCG is aTCG. aaTCG, caTCG, gaTCG, taTCG, aTCGa, aTCGt, aTCGc, and aTCGg should all be created. All of the matches you would find with the other eight mutations would be found with aTCG. – Joe Kowzun Jul 29 '19 at 18:57
  • @JoeKowzun hopefully my most recent edits addressed that. – Brian Jul 29 '19 at 19:12
  • I just checked with what I currently see as your most up to date version, and they did not. – Joe Kowzun Jul 29 '19 at 19:24
  • Ok, I think I know what it is. The A in ATCG gets substituted to A, then an A gets inserted between A and T. It looks identical to A getting inserted before ATCG, then having the original A get substituted to A. Same thing happens anytime an insertion exactly mimics the complete sequence that follows or precedes it, then the original sequence is mutated. (E.g. Mutating AT to GC then inserting AT after the T in the sequence ATCG looks exactly like inserting GC before ATCG, or CG is inserted, then the original CG gets mutated to TA). – Joe Kowzun Jul 29 '19 at 22:58
  • @DavidO and Brian, I think (I don't know for certain) if I use lapply to feed the result David O's makeSub function to a grepl function using his result as a pattern, and the result of Brian's function as a query to search in, then remove any pattern from your result that corresponds to a TRUE after the first TRUE in each element from the list made by lapply, I can get rid of the results from Brian's function that look identical insertions before or after a variant of the original sequence. – Joe Kowzun Jul 29 '19 at 23:23
  • At the risk of upsetting SO algorithms for too much talking *and* pushing biological credulity...one approach for making all possible permutations (using `magrittr` for convenience) is: `library(magrittr); ans <- c(makeIns(seq1, 1), makeDel(seq1, 1)) %>% lapply(makeSub, 1) %>% unlist %>% unique` gives you all 603 possibilities of that sequence with insertions, deletions and substitutions! Good luck again! It's been fun! – David O Jul 30 '19 at 01:43
2

EDITED Entirely revised for a third time to better address the question! Incidentally, the key solution here (in the form of three helper functions) does not require the Biostrings package.

As I understand the problem, a short DNA query sequence is to be matched against a large number of reference DNA sequences. The twist here is that an arbitrary number of variations in the form of insertions or deletions on the DNA query sequence are to be searched for in the reference DNA sequences.

The function vmatchPattern() from the Biostrings package can identify matches of a given pattern with an arbitrary number of mismatches in a set of reference sequences. In addition, vmatchPattern() can identify matches of a given pattern with possible insertions or deletions (indel). However, unlike matchPattern(), vmatchPattern() cannot do both at the same time.

The solution sought here is to generate generate variations of the DNA query sequence that can then be passed to a search function such as grepl() or as suggested here, vmatchPattern().

The revised solution posted here includes three functions. makeDel() will generate all possible variants of a short sequence with an arbitrary number of deletions. The companion function, makeIns() will generate variants of the short sequence with the insertion specified as the IUPAC symbol in symbol. makeSub() will make the desired substitutions using the bases specified by the IUPAC code in symbol. This approach, generating all possible combinations of other bases, allows the character strings to be directly used in pattern-matching functions including vmatchPaterrn.

If it is going to be used, this ensures that the package Biostrings is available. This code applies to versions of R at 3.60 and beyond.

  if (!require("Biostrings")) {
    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install("Biostrings")
  }
  library(Biostrings)

Now some test data. The original query sequence "GGCGACTG" will be used as the "query" and 1000 random sequences between 200 and 400 nucleotides will be used as the reference set.

  seq1 <- "GGCGACTG"

  set.seed(1234)
  ref <- replicate(1e3,
    sample(c("A", "C", "G", "T"), sample(200:400, 1), replace = TRUE),
    simplify = FALSE)
  ref <- sapply(ref, paste, collapse = "")
  ref <- DNAStringSet(ref)

Before proceeding with the solution, let's peek at what can be found with the pattern alone.

# how times does the pattern occur? 
  table(vcountPattern(seq1, ref)) # just 3 times
>   0   1 
> 997   3 

# how many times allowing for one mismatch?
# once in 96 sequences and twice in three sequences
  n <- vcountPattern(seq1, ref, max.mismatch = 1)
  table(n)
> n
>   0   1   2 
> 901  96   3 

# examine the matched sequences
  m <- vmatchPattern(seq1, ref, max.mismatch = 1) # find the patterns
  sel <- which.max(n) # select the first one with 2 matches
  Views(ref[[sel]], m[[sel]]) # examine the matches
>   Views on a 384-letter DNAString subject
> subject: TCGCGTCGCACTTCTGCTAACACAGC...GCCCAGTCGACTGCTGCTCGGATTGC
> views:
>     start end width
> [1]   104 111     8 [GGCGACCG]
> [2]   364 371     8 [GTCGACTG] 

Here are the three helper functions to generate the variants. The argument seq can be a character string such as "GGGCGACTG" or a DNAString object. The argument n is an integer that specifies the upper limit on deletions, insertions, or substitutions. These functions will create variants with 0, 1, ..., n deletions, insertions or substitutions. If n is set to 0, the function will return the original sequence. The argument symbol for makeIns() and makeSub() should be a single IUPAC character to specify which bases will be inserted or substituted. The default value of "N" specifies all possible bases ("A", "C", "G" and "T").

makeDel() use combn() to identify the deletion positions. The logic for makeIns() and makeSub() is a bit more complex because of the need to allow insertions to be adjacent to each other and the need to create all combinations. Here I chose not to add insertions at the beginning or end of the query sequence.

All functions return a character vector suitable for use in vmatchPattern() or grep().

To create deletions in a DNA string:

  ##
  ## makeDel - create 1:n deletions in a character string (DNA sequence)
  ##  return a character vector of all possible variants
  ##
  makeDel <- function(seq, n) {
  # accept only a single value for 'seq'
    cseq <- as.character(seq)
    cseq <- unlist(strsplit(cseq[1], ""))
    nseq <- length(cseq)

  # simple argument checks
    if (!is(n, "numeric")) stop("'n' must be an integer")
    if (n == 0) return(paste(cseq, collapse = ""))
    if (n >= nseq) stop("too many deletions for ", nseq, " letters")

  # create all possible combinations to be dropped in 'index'
    index <- lapply(seq_len(n), function(j) combn(nseq, j, simplify = FALSE))
    index <- unlist(index, recursive = FALSE)

  # drop base in each possible position and reassemble
    ans <- lapply(index, function(idx) cseq[-idx])
    ans <- sapply(ans, paste, collapse = "")
    ans <- unique(ans) # remove duplicates
    return(ans)
  }

To create insertions in a DNA string:

  ##
  ## makeIns - create 1:n insertions into DNA string (character vector)
  ##   where each insertion is one of a given IUPAC-specified symbol
  ##   return a character vector of all possible variants
  ##
  makeIns <- function(seq, n, symbol = "N") {
  # IUPAC codes for ambiguous bases
    iupac <- c(N = "ACGT", A = "A", C = "C", G = "G", T = "T", M = "AC", R = "AG",
      W = "AT", S = "CG", Y = "CT", K = "GT", V = "ACG", H = "ACT",
      D = "AGT", B = "CGT")

 # only accept single value for 'seq'
    cseq <- as.character(seq)
    cseq <- unlist(strsplit(cseq[1], ""))
    nseq <- length(cseq)

 # simple argument checks
    if (!is(n, "numeric")) stop("'n' must be an integer")
    symbol <- toupper(symbol)
    if (nchar(symbol) != 1 | !symbol %in% names(iupac))
      stop("'symbol' must be a single valid IUPAC symbol")
    if (n == 0) return(paste(cseq, collapse = ""))
    if (n >= nseq) stop("seems like too many insertions for ", nseq, " letters")

  # which bases are to be inserted?
    ACGT <- strsplit(iupac[symbol], "")[[1]]

  # create all possible combinations of positions for the insertion 
    ipos <- seq_len(nseq - 1) # insert after this position
    index <- lapply(1:n, function(i) do.call(expand.grid, rep(list(ipos), i)))
    index <- lapply(index, function(v) split(v, seq_len(nrow(v))))
    index <- unlist(index, recursive = FALSE)
    index <- lapply(index, unlist)
    index <- lapply(index, sort)

  # place the required number of insertions after each position in index
    res <- lapply(index, function(idx) {
      tally <- rle(idx)$lengths
      breaks <- diff(c(0, idx, nseq))
      NN <- Map(base::rep, symbol, tally)
      spl <- split(cseq, rep(seq_along(breaks), breaks))
      sel <- head(seq_along(spl), -1)
      spl[sel] <- Map(base::c, spl[sel], NN)
      ans <- unlist(spl)
      if (length(ACGT) > 1) { # replicate and replace with appropriate bases
        sites <- grep(symbol, ans)
        nsites <- length(sites)
        nsymbol <- length(ACGT)

        bases <- expand.grid(rep(list(ACGT), nsites), stringsAsFactors = FALSE)
        bases <- as.matrix(bases)
        nvars <- nrow(bases)

        ans <- do.call(rbind, rep(list(ans), nvars))
        ans[, sites] <- bases
        ans <- split(ans, seq_len(nvars))
        ans <- lapply(ans, paste, collapse = "")
      }
      else
        ans <- paste(ans, collapse = "")
      return(ans)
    })
    res <- unlist(res)
    res <- unique(res)
    return(res)
  }

To create substitutions in DNA string:

  ##
  ## makeSub - create an arbitrary number of substitutions in each 1:n positions
  ##   with the IUPAC bases specified by 'symbol'
  ##   return a character vector with all possible variants
  ##
  makeSub <- function(seq, n, symbol = "N")
  {
  # IUPAC codes for ambiguous bases
    iupac <- c(N = "ACGT", A = "A", C = "C", G = "G", T = "T", M = "AC", R = "AG",
      W = "AT", S = "CG", Y = "CT", K = "GT", V = "ACG", H = "ACT",
      D = "AGT", B = "CGT")

  # accept only a single value for 'seq'
    cseq <- as.character(seq)
    cseq <- unlist(strsplit(cseq[1], ""))
    nseq <- length(cseq)

  # simple argument checks
    if (!is(n, "numeric")) stop("'n' must be an integer")
    symbol <- toupper(symbol)
    if (nchar(symbol) != 1 | !symbol %in% names(iupac))
      stop("'symbol' must be a single valid IUPAC symbol")
    if (n == 0) return(paste(cseq, collapse = ""))
    if (n > nseq) stop("too many substitutions for ", nseq, " bases")

  # which bases are to be used for the substitution?
    ACGT <- strsplit(iupac[symbol], "")[[1]]

  # create all possible combinations of positions to be changed in 'index'
    index <- lapply(seq_len(n), function(j) combn(nseq, j, simplify = FALSE))
    index <- unlist(index, recursive = FALSE)

  # for each numeric vector in index, create as many variants as
  # alternative bases are needed, collect in 'ans'
    ans <- lapply(index, function(idx) {
      bases <- lapply(cseq[idx], function(v) setdiff(ACGT, v))
      bases <- bases[sapply(bases, length) > 0] # defensive 
      bases <- expand.grid(bases, stringsAsFactors = FALSE)
      bases <- as.matrix(bases)
      nvars <- nrow(bases)

      vars <- do.call(rbind, rep(list(cseq), nvars))
      vars[ ,idx] <- bases
      if (!is.null(vars))
        return(split(vars, seq_len(nvars)))
    })
    ans <- unlist(ans, recursive = FALSE)
    ans <- sapply(ans, paste, collapse = "")
    ans <- unique(ans) # remove duplicates
    return(ans)
  }

Examples of the output:

  makeDel(seq1, 0)
> [1] "GGCGACTG"

  makeDel(seq1, 1)
> [1] "GCGACTG" "GGGACTG" "GGCACTG" "GGCGCTG" "GGCGATG" "GGCGACG" "GGCGACT"

  makeDel(seq1, 2)
>  [1] "GCGACTG" "GGGACTG" "GGCACTG" "GGCGCTG" "GGCGATG" "GGCGACG" "GGCGACT"
>  [8] "CGACTG"  "GGACTG"  "GCACTG"  "GCGCTG"  "GCGATG"  "GCGACG"  "GCGACT" 
> [15] "GGGCTG"  "GGGATG"  "GGGACG"  "GGGACT"  "GGCCTG"  "GGCATG"  "GGCACG" 
> [22] "GGCACT"  "GGCGTG"  "GGCGCG"  "GGCGCT"  "GGCGAG"  "GGCGAT"  "GGCGAC" 

  makeIns(seq1, 1) # default form
>  [1] "GAGCGACTG" "GCGCGACTG" "GGGCGACTG" "GTGCGACTG" "GGACGACTG" "GGCCGACTG"
>  [7] "GGTCGACTG" "GGCAGACTG" "GGCGGACTG" "GGCTGACTG" "GGCGAACTG" "GGCGCACTG"
> [13] "GGCGTACTG" "GGCGACCTG" "GGCGAGCTG" "GGCGATCTG" "GGCGACATG" "GGCGACGTG"
> [19] "GGCGACTTG" "GGCGACTAG" "GGCGACTCG" "GGCGACTGG"

  makeIns(seq1, 1, symbol = "Y") # inserting only "C" or "T"
>  [1] "GCGCGACTG" "GTGCGACTG" "GGCCGACTG" "GGTCGACTG" "GGCTGACTG" "GGCGCACTG"
>  [7] "GGCGTACTG" "GGCGACCTG" "GGCGATCTG" "GGCGACTTG" "GGCGACTCG"

  makeSub("AAA", 1)
> [1] "CAA" "GAA" "TAA" "ACA" "AGA" "ATA" "AAC" "AAG" "AAT"

  makeSub("AAA", 2)
>  [1] "CAA" "GAA" "TAA" "ACA" "AGA" "ATA" "AAC" "AAG" "AAT" "CCA" "GCA" "TCA"
> [13] "CGA" "GGA" "TGA" "CTA" "GTA" "TTA" "CAC" "GAC" "TAC" "CAG" "GAG" "TAG"
> [25] "CAT" "GAT" "TAT" "ACC" "AGC" "ATC" "ACG" "AGG" "ATG" "ACT" "AGT" "ATT"

These functions can be used together with vmatchPattern() to create variants and extract matches. One suggested approach would be to first find those sequences with mismatches using max.mismatch = 1. Next, find sequences with deletions and with insertions using vmatchPattern() with fixed = FALSE and the default value of 0 for max.mismatch.

Alternatively, the explicit patterns generated by the helper functions can be passed to grep processes running in parallel! What follows shows the use of vmatchPattern but there may be reasons to perform the analysis with different tools. See the comments on this topic.

# first, allow mismatches to the original pattern
# the result is a "ByPos_MIndex" object of length 1000
  m1 <- vmatchPattern(seq1, ref, max.mismatch = 1) # as before...
  n1 <- elementNROWS(m1) # counts the number of elements (matches)
  which(n1 > 0) # which of the 1000 ref sequence had a match with 0 or 1 mismatches?
>  [1]  14  71  73  76  79  88  90 108 126 129 138 141 150 160 163 179 180 195 200
> [20] 205 212 225 227 239 241 246 247 255 276 277 280 299 310 335 338 345 347 357
> [39] 359 369 378 383 387 390 391 404 409 410 414 418 469 472 479 488 499 509 523
> [58] 531 533 567 571 574 580 588 590 591 594 601 634 636 646 654 667 679 685 694
> [77] 696 713 717 732 734 737 749 750 761 762 783 815 853 854 857 903 929 943 959
> [96] 969 981 986 998

# Second search each of the patterns with lapply
# generates seven lists of objects, each of length 10000
  pat2 <- makeDel(seq1, 1)
  m2 <- lapply(pat2, function(pat) vmatchPattern(pat, ref))

# generates 22 lists of objects, each of length 10000
  pat3 <- makeIns(seq1, 1)
  m3 <- lapply(pat3, function(pat) vmatchPattern(pat, ref))

The second and third results in m2 and m3 are lists of "ByPos_MIndex" objects. The example below extracts the number of matches from m2 and shows these matches in an abbreviated form with str(). Each value in a list identifies the reference sequence that had at least one match with respective pattern.

  n2 <- lapply(m2, elementNROWS)
  str(sapply(n2, function(n) which(n > 0)))
> List of 7
>  $ : int [1:14] 14 138 179 335 369 391 567 679 713 734 ...
>  $ : int [1:18] 138 200 240 298 310 343 510 594 598 599 ...
>  $ : int [1:15] 21 26 45 60 260 497 541 600 607 642 ...
>  $ : int [1:17] 27 54 120 121 123 132 210 242 244 257 ...
>  $ : int [1:18] 15 33 110 126 154 419 528 539 546 606 ...
>  $ : int [1:12] 24 77 79 139 525 588 601 679 770 850 ...
>  $ : int [1:15] 179 345 378 414 469 571 574 580 591 713 ...

This final example examines the third list of 22 "ByPos_MIndex" objects (m3) by the same mechanism. It shows that some of the 22 variants fail to match, some match once and five match twice.

    n3 <- lapply(m3, elementNROWS) # extract all counts
    names(n3) <- sprintf("pat_%02d", seq_along(n3)) # for convenience
    str(lapply(n3, function(n) which(n > 0)))
> List of 22
>  $ pat_01: int 679
>  $ pat_02: int 391
>  $ pat_03: int(0) 
>  $ pat_04: int 737
>  $ pat_05: int(0) 
>  $ pat_06: int(0) 
>  $ pat_07: int 108
>  $ pat_08: int 276
>  $ pat_09: int 439
>  $ pat_10: int [1:2] 764 773
>  $ pat_11: int(0) 
>  $ pat_12: int [1:2] 22 820
>  $ pat_13: int 795
>  $ pat_14: int [1:2] 914 981
>  $ pat_15: int(0) 
>  $ pat_16: int 112
>  $ pat_17: int 884
>  $ pat_18: int(0) 
>  $ pat_19: int [1:2] 345 378
>  $ pat_20: int [1:2] 571 854
>  $ pat_21: int 574
>  $ pat_22: int(0) 

Needless to say, a lot of data wrangling remains in order to extract sequence information. This can be coded with the help pages for matchPattern and with some understanding of the pattern matching logic described in help("lowlevel-matching", package = "Biostrings").

Although the routines in Biostrings use very fast and very memory-efficient algorithms for handling large sequences. Joe seems to find raw searching faster under other circumstances. There's always more to learn!

David O
  • 803
  • 4
  • 10
  • So GGCGACTG is not the only sequence I am looking for. I am actually looking for four other sequences in addition to GGCGACTG. The way I currently do it is to find the length of the number of characters of a `matchPattern` result (`nchar` will give me a vector of widths for every match, and the length is the number of matches in the sequence). If the length is greater than zero, that means there is a match between the pattern and the Illumina sequence. There are over 50k sequences in the file (in some files there are as many as 7 million sequences). – Joe Kowzun Jul 28 '19 at 19:35
  • 1
    I currently `parSapply` to go through the sequences faster. Since I can't nest `parSapply`s to go parallel over both the Illumina sequences and the patterns I am searching for, I was suggested to vectorize my patterns, and feed the vector to `grepl`. `matchPattern` will generate the mutations for me with the mismatch arguments. If I do it through `grepl`, I need to tell it the mutations. As I've shown in the code I created, I can make all the 1 mutation patterns for an 8 nucleotide sequence, and when I fed that to `grepl`, it took less than a second. – Joe Kowzun Jul 28 '19 at 19:37
  • The `matchPattern` equivalent takes 30 seconds on average. I got the same result when doing it, so I am confident in the instance I am using it, `grepl` is faster. – Joe Kowzun Jul 28 '19 at 19:37
  • I am pretty sure I can, but I just want to make sure, since I am a coding n00b, if I just want a huge vector of all possible mutations, allowing for 2 mutations, I would do `mutations_in_seq1 <- c(makeSub(seq1,2),makeIns(seq1,2),makeDel(seq1,2))`, after making these functions available to my coding environment? – Joe Kowzun Jul 29 '19 at 13:59
  • Yes, exactly! And I see why you would want to hand them to a parallel processing routine. It wouldn't hurt to pass the final vector through `unique` although I think it's not necessary. – David O Jul 29 '19 at 14:33
  • And if it works for you, I suggest accepting the answer. That will help others find this in the future and call attention to the solution of @Brian, which may provides insight into the heart of this problem. – David O Jul 29 '19 at 14:47
  • The only difference between what is generated in my code and `mutations_in_seq1 <- c(makeSub(seq1,1),makeIns(seq1,1),makeDel(seq1,1))` is that my vector contains the original sequence. That's just a matter of adding seq1 to the `c` function. So you are right in that `unique` is not needed, as my original code removes all duplicates with `mutsinseq1 <- mutsinseq1[-which(duplicated(mutsinseq1))]`. – Joe Kowzun Jul 29 '19 at 14:55
  • I just noticed one naive thing about splitting it up into 3 different functions. If you allow 2 mutations, you will lose all mutations that would happen by simultaneous insertion and substitutions (or deletions and substitutions, and some insertions and deletions). Like with ACGT, ACCTT wouldn't be listed as a possible mutation, but if you delete 1 C and substitute a G for the first T (2 mutations), you go back to ACGT. – Joe Kowzun Jul 29 '19 at 16:13
  • Hah, hah! Good point. That's why I wish the `with.indels = TRUE` and `max.mismatch = 2` options could be used with `vmatchPattern()`. I suspect somebody or some package provides for the functions I've described here. – David O Jul 29 '19 at 16:52
  • For now, my boss thinks that in 8 nucleotides only 1 mismatch is likely to show up (I have scrolled through some of my narrowed down results when not allowing for mutations, and noticed that they are 1 off from some of my sequences, so I tend to agree). But by allowing for 1 mutation in some 8 nucleotide sequences, I might lose specificity (for GGCGACTG, CGACTG only occurs once in the virus genome, so 1 mutation of GGCGACTG is fine still, but the same might not be able to be said for all other sequences). – Joe Kowzun Jul 29 '19 at 17:12
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/197222/discussion-between-joe-kowzun-and-david-o). – Joe Kowzun Jul 30 '19 at 14:57