I need to find DNA sequence of all possible occurrences (whether overlapping, partially overlapping, or not) of a given sample that starts with “AAA” or “GAA” and ends with “AGT” and have at least 2 other triplets (1 triplet= combination of 3 letters) between the start and the end.
The code below only gives one sequence consisting of the maximum number of triplets in the sequence. I want results for all the sequences starting from a minimum of 2 triplets. I don't have a limit for maximum, so It has to give all possible combinations starting from 2 triplets.
Can anyone help with this part?
library( stringr )
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,}" ), end )
str_extract_all( dna, regex )