I'd like to get the substrings of long DNA sequences
For example, given:
1/ATXGAAATTXXGGAAGGGGTGG
2/AATXGAAGGAAGGAAGGGGATATTX
3/AAAAAATTXXGGAAGGGGXTTTA
4/AAAATTXXATAXXGGAAGGGGXTXG
5/ATTATTGTTXAXTATTT
the output is to be:
1/TXG - TTXX
2/TXG -
3/ - TTXX
4/TTXX - TXG
5/ -
I tried the following regex pattern:
(TXG|TTXX)
and it works, and the results are put in a list but I don't know how to retrieve the order of each result that has appeared in the original sequences. That is,
whether TTXX
and TXG
appear first and second respectively as in sequence 4 but second and first as in sequence 1; and in 2nd and 3rd results, that is harder because match-xx function call doesn't offer an index of the substring which it took from the sequence in question. Thank you for your insights.