I have data that looks like this using R.
> hits
Views on a 51-letter DNAString subject
subject: TCAGAAACAAAACCCAAAATCAGTAAGGAGGAGAAAGAAACCTAGGGAGAA
views:
start end width
[1] 1 10 10 [TCAGAAACAA]
[2] 14 23 10 [CCAAAATCAG]
[3] 19 28 10 [ATCAGTAAGG]
[4] 20 29 10 [TCAGTAAGGA]
[5] 21 30 10 [CAGTAAGGAG]
So I have a 51 length string called
subject = TCAGAAACAAAACCCAAAATCAGTAAGGAGGAGAAAGAAACCTAGGGAGAA
.
5 substrings are extracted from this subject
. You can see them above. I'd like to see if the 5 substrings are in my area of interest. This area of interest is from position 14 - 27
.
subject = TCAGAAACAAAAC |-> CCAAAATCAGTAAG <-| GAGGAGAAAGAAACCTAGGGAGAA
.
In other words, I have 5 substrings from the subject
string. Out of these 5 strings, I am only looking for strings that lie between position 14 - 27
of the subject
string. This is my area of interest.
The first [1]
substring [TCAGAAACAA]
is not that important since it is embedded right at the start (given by the coordinates 1 - 10
) and is outside my area of interest.
The second [2]
string given by the coordinates 14 - 23
tells me that it in entirely embedded in my area of interest (which again is 14 - 27
).
The third [3]
string is given by the coordinates 19 - 28
. This is important to me as the majority of the string is embedded in my area of interest.
The fourth [4]
string is given by the coordinates 20 - 29
. Again this is important to me since the majority of the string is embedded in my area of interest except the last the characters.
The story is the same for the fifth substring.
Basically if 60% of the string is embedded in my area of interest I'd like to count it.
Can someone give me an algorithm in pseudocode that can do this? I have been thinking about this for a while drawing diagrams but I can't seem to implement it. I am doing this in R so I will convert the pseudocode to R. Also the number 60% is arbritrary. I'll have to confirm this with my supervisor but I am sure this is irrelevant.