-1

If my string is a DNA sequence,

x<-"TATAATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG"

I want to extract substring from ATG to TAA, TGA or TAG. I am able to extract from one point to another by using stringi package with regex.

My code is

stri_extract_all(x, regex = "ATG.*?TAA")

Help me by solving my query.

G5W
  • 36,531
  • 10
  • 47
  • 80
  • 3
    `from ATG to TAA` ... to _which_ `TAA`? There could be many `TAA` bases after the `ATG`. – Tim Biegeleisen Jun 14 '18 at 12:30
  • TAA which comes after ATG, yes there could be many TAA in a sequence and i want to extract them all. but not just TAA but TAG and TGA as well – charu sonwal Jun 14 '18 at 12:32
  • something likt this will problaly work: `regmatches(x, gregexpr("(?<=ATG).*?(?=TAA)", x, perl = TRUE))`. What do you want to do with the TGA and TAG? – Wimpel Jun 14 '18 at 12:35

2 Answers2

1

I believe that you meant str_extract_all from the stringr package. That function does not have an argument called regex; you need pattern. Once you get by that, you can just use or | to allow any of the sequence endings.

library(stringr)
str_extract_all(x, pattern="ATG.*?(TAA|TGA|TAG)")
[[1]]
[1] "ATGCAACGAGGGGCATAA" "ATGCCCAAAATCTGA"    "ATGACCGGGTAG"
G5W
  • 36,531
  • 10
  • 47
  • 80
1

Here is a possibility using Biostrings:

library("Biostrings")

x <- "TATAATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG"

# Get all combinations of substrings starting with "ATG" and ending with "TAA"
library(tidyverse)
df <- expand.grid(start(matchPattern("ATG", x)), end(matchPattern("TAA", x))) %>%
    filter(Var1 < Var2);
ir <- IRanges(df[, 1], df[, 2]);

extractAt(BString(x), IRanges(df[, 1], df[, 2]));
#A BStringSet instance of length 3
#  width seq
#[1]    18 ATGCAACGAGGGGCATAA
#[2]    44 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAA
#[3]    20 ATGCCCAAAATCTGATATAA

Since you're working with DNA sequence data, I recommend familiarising yourself with Biostrings from Bioconductor. There exist many Bioconductor packages beyond Biostrings that will make your life a lot easier (down the track), when you're working with DNA/RNA sequence data.


Update

To account for multiple stop codons, simply wrap end(matchPattern(...)) within an sapply loop.

df <- expand.grid(
    start(matchPattern("ATG", x)),
    unlist(sapply(c("TAA", "TGA", "TAG"), function(ss) end(matchPattern(ss, x))))) %>%
    filter(Var1 < Var2);
ir <- IRanges(df[, 1], df[, 2]);

extractAt(BString(x), IRanges(df[, 1], df[, 2]));
# [1]    18 ATGCAACGAGGGGCATAA
# [2]    44 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAA
# [3]    20 ATGCCCAAAATCTGATATAA
# [4]    39 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGA
# [5]    15 ATGCCCAAAATCTGA
# ...   ... ...
# [7]    23 ATGCCCAAAATCTGATATAATGA
# [8]     4 ATGA
# [9]    55 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG
#[10]    31 ATGCCCAAAATCTGATATAATGACCGGGTAG
#[11]    12 ATGACCGGGTAG
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • 1
    But i just dont want it to stop at TAA only, my question is how could i make it to start from ATG and stops at if it finds TAA, TGA or TAG in a sequence and extract it and repeats it to check more substrings in the sequence – charu sonwal Jun 14 '18 at 12:52