0

I need to find DNA sequence of all possible occurrences (whether overlapping, partially overlapping, or not) of a given sample that starts with “AAA” or “GAA” and ends with “AGT” and have at least 2 other triplets (1 triplet= combination of 3 letters) between the start and the end.

The code below only gives one sequence consisting of the maximum number of triplets in the sequence. I want results for all the sequences starting from a minimum of 2 triplets. I don't have a limit for maximum, so It has to give all possible combinations starting from 2 triplets.

Can anyone help with this part?

library( stringr )
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,}" ), end )
str_extract_all( dna, regex )
  • Does this answer your question? [Find the sequence using R](https://stackoverflow.com/questions/66328891/find-the-sequence-using-r) – Wimpel Mar 01 '21 at 09:23

1 Answers1

1

I would use this regex pattern:

^[AG]AA[ACGT]{6,}AGT$

Sample script:

sequences = c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT",  # a match
              "CGACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGCCC")  # not a match
matches <- sequences[grepl("^[AG]AA[ACGT]{6,}AGT$", sequences)]
matches

[1] "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360