How to find substring from string in R?

Question

If my string is a DNA sequence,

x<-"TATAATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG"

I want to extract substring from ATG to TAA, TGA or TAG. I am able to extract from one point to another by using stringi package with regex.

My code is

stri_extract_all(x, regex = "ATG.*?TAA")

Help me by solving my query.

`from ATG to TAA` ... to _which_ `TAA`? There could be many `TAA` bases after the `ATG`. — Tim Biegeleisen, Jun 14 '18 at 12:30
TAA which comes after ATG, yes there could be many TAA in a sequence and i want to extract them all. but not just TAA but TAG and TGA as well — charu sonwal, Jun 14 '18 at 12:32
something likt this will problaly work: `regmatches(x, gregexpr("(?<=ATG).*?(?=TAA)", x, perl = TRUE))`. What do you want to do with the TGA and TAG? — Wimpel, Jun 14 '18 at 12:35

score 1 · Answer 1 · answered Jun 14 '18 at 12:39

I believe that you meant str_extract_all from the stringr package. That function does not have an argument called regex; you need pattern. Once you get by that, you can just use or | to allow any of the sequence endings.

library(stringr)
str_extract_all(x, pattern="ATG.*?(TAA|TGA|TAG)")
[[1]]
[1] "ATGCAACGAGGGGCATAA" "ATGCCCAAAATCTGA"    "ATGACCGGGTAG"

Maurits Evers · Answer 2 · 2018-06-14T12:58:13.007

Here is a possibility using Biostrings:

library("Biostrings")

x <- "TATAATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG"

# Get all combinations of substrings starting with "ATG" and ending with "TAA"
library(tidyverse)
df <- expand.grid(start(matchPattern("ATG", x)), end(matchPattern("TAA", x))) %>%
    filter(Var1 < Var2);
ir <- IRanges(df[, 1], df[, 2]);

extractAt(BString(x), IRanges(df[, 1], df[, 2]));
#A BStringSet instance of length 3
#  width seq
#[1]    18 ATGCAACGAGGGGCATAA
#[2]    44 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAA
#[3]    20 ATGCCCAAAATCTGATATAA

Since you're working with DNA sequence data, I recommend familiarising yourself with Biostrings from Bioconductor. There exist many Bioconductor packages beyond Biostrings that will make your life a lot easier (down the track), when you're working with DNA/RNA sequence data.

Update

To account for multiple stop codons, simply wrap end(matchPattern(...)) within an sapply loop.

df <- expand.grid(
    start(matchPattern("ATG", x)),
    unlist(sapply(c("TAA", "TGA", "TAG"), function(ss) end(matchPattern(ss, x))))) %>%
    filter(Var1 < Var2);
ir <- IRanges(df[, 1], df[, 2]);

extractAt(BString(x), IRanges(df[, 1], df[, 2]));
# [1]    18 ATGCAACGAGGGGCATAA
# [2]    44 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAA
# [3]    20 ATGCCCAAAATCTGATATAA
# [4]    39 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGA
# [5]    15 ATGCCCAAAATCTGA
# ...   ... ...
# [7]    23 ATGCCCAAAATCTGATATAATGA
# [8]     4 ATGA
# [9]    55 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG
#[10]    31 ATGCCCAAAATCTGATATAATGACCGGGTAG
#[11]    12 ATGACCGGGTAG

But i just dont want it to stop at TAA only, my question is how could i make it to start from ATG and stops at if it finds TAA, TGA or TAG in a sequence and extract it and repeats it to check more substrings in the sequence — charu sonwal, Jun 14 '18 at 12:52

How to find substring from string in R?

2 Answers2

Update

Linked