How to extract all characters before and after a certain set of characters in R while making sure those characters are first/last in the string?

Question

I have a long string:

my_string = "GTCAGTCGATCTGGGCATTATGCGTCAAAAGGCTGCTAGCTAAAGCTGATCAGCATCAAAAGGCCGCCCCTATGCTACGAGCATCATGCATCTGGGTCTAGCTAGTGGGCATTCTCTCTGCTGCATTCAGTCACAAAAGGTGTCAGTCGTAGTCATCATCTACATCGTTCATGCTGGGCATTACAGTCAGTCACAAAAGGTCAGTCAGTCA"

I want to extract two things from this string:

Everything "before" the first encountered CAAAAG
Everything "after" the last encountered TGGGCATT

Everything before CAAAAG can be found like this:

stringr::word(my_string, 1, sep = "CAAAAG")

But how do I make sure that it is "first" CAAAAG in the string? And that I am receiving all characters found before the very first CAAAAG?

The same goes for TGGGCATT. I can receive everything "after" TGGGCATT in this way:

stringr::word(my_string, -1, sep = "TGGGCATT")

But how do I make sure that I am getting all characters coming "after" the LAST TGGGCATT in my string?

what do you mean by _making sure_? you can get first index of `CAAAAG`, then substring from 0 with length of N where N is index of `CAAAAG`. also for getting things after `TGGGCATT`, get last index of it, substring from `N` with length of `L - 1 - (N + l)` where N is index, `l` is length of word, L is length of string. — M.kazem Akhgary, Jan 22 '18 at 14:50
I don't know R but getting first index and last index of string are common methods. they also return `-1` if string does not contain word you are looking for. — M.kazem Akhgary, Jan 22 '18 at 14:52

Balter · Accepted Answer · 2018-01-22T17:15:12.957

I think I've got two ways that I used for each.

my_string = "GTCAGTCGATCTGGGCATTATGCGTCAAAAGGCTGCTAGCTAAAGCTGATCAGCATCAAAAGGCCGCCCCTATGCTACGAGCATCATGCATCTGGGTCTAGCTAGTGGGCATTCTCTCTGCTGCATTCAGTCACAAAAGGTGTCAGTCGTAGTCATCATCTACATCGTTCATGCTGGGCATTACAGTCAGTCACAAAAGGTCAGTCAGTCA"

library(stringr)

str_match_all(my_string, '(.*?)CAAAAG')

#[[1]]
#     [,1]                                                                           
#[1,] "GTCAGTCGATCTGGGCATTATGCGTCAAAAG"                                              
#[2,] "GCTGCTAGCTAAAGCTGATCAGCATCAAAAG"                                              
#[3,] #"GCCGCCCCTATGCTACGAGCATCATGCATCTGGGTCTAGCTAGTGGGCATTCTCTCTGCTGCATTCAGTCACAAAAG"
#[4,] "GTGTCAGTCGTAGTCATCATCTACATCGTTCATGCTGGGCATTACAGTCAGTCACAAAAG"                 
#     [,2]                                                                     
#[1,] "GTCAGTCGATCTGGGCATTATGCGT"                                              
#[2,] "GCTGCTAGCTAAAGCTGATCAGCAT"                                              
#[3,] "GCCGCCCCTATGCTACGAGCATCATGCATCTGGGTCTAGCTAGTGGGCATTCTCTCTGCTGCATTCAGTCA"
#[4,] "GTGTCAGTCGTAGTCATCATCTACATCGTTCATGCTGGGCATTACAGTCAGTCA"  

first.match <- str_match_all(my_string, '(.*?)CAAAAG')[[1]][1,2]
first.match
#[1] "GTCAGTCGATCTGGGCATTATGCGT"

str_locate_all(my_string, 'TGGGCATT')
#[[1]]
#     start end
#[1,]    12  19
#[2,]   106 113
#[3,]   175 182
second.match.index <- str_locate_all(my_string, 'TGGGCATT')[[1]]
second.match <- substr(my_string,second.match.index[nrow(second.match.index),ncol(second.match.index)]+1,
                       nchar(my_string))

second.match
#[1] "TACAGTCAGTCACAAAAGGTCAGTCAGTCA"

Edit: Added '+1' because you want the very next index, not the one where the searched string ends.

score 0 · Answer 2 · answered Jan 22 '18 at 22:53

First, check number of occurrences:

gregexpr('CAAAAG', my_string)

[[1]]
[1]  26  57 134 194
attr(,"match.length")
[1] 6 6 6 6
attr(,"useBytes")
[1] TRUE

gregexpr('TGGGCATT', my_string)
[[1]]
[1]  12 106 175
attr(,"match.length")
[1] 8 8 8
attr(,"useBytes")
[1] TRUE

Now you can double check that this pair of expressions return the same characters:

# Before first occurence of CAAAAG
stringr::word(my_string, 1, sep = "CAAAAG")
substr(my_string, 0, 26 - 1) # 26 first occurrence

# After last occurrence of TGGGCATT
stringr::word(my_string, -1, sep = "TGGGCATT")
substr(my_string, 175 + 8, nchar(my_string)) # 175 last occurrence + lenght of 'TGGGCATT'

Additionally you can obtain the same results with sub and regular expressions from the base package:

# Before first occurence of CAAAAG
sub('CAAAAG.*$', '', my_string)

[1] "GTCAGTCGATCTGGGCATTATGCGT"

# After last occurrence of TGGGCATT
sub('.*TGGGCATT', '\\1', my_string)

[1] "ACAGTCAGTCACAAAAGGTCAGTCAGTCA"

How to extract all characters before and after a certain set of characters in R while making sure those characters are first/last in the string?

2 Answers2