see if a string is embedded in a larger string

Question

I have data that looks like this using R.

> hits
  Views on a 51-letter DNAString subject
subject: TCAGAAACAAAACCCAAAATCAGTAAGGAGGAGAAAGAAACCTAGGGAGAA
views:
    start end width
[1]     1  10    10 [TCAGAAACAA]
[2]    14  23    10 [CCAAAATCAG]
[3]    19  28    10 [ATCAGTAAGG]
[4]    20  29    10 [TCAGTAAGGA]
[5]    21  30    10 [CAGTAAGGAG]

So I have a 51 length string called

subject = TCAGAAACAAAACCCAAAATCAGTAAGGAGGAGAAAGAAACCTAGGGAGAA.

5 substrings are extracted from this subject. You can see them above. I'd like to see if the 5 substrings are in my area of interest. This area of interest is from position 14 - 27.

subject = TCAGAAACAAAAC |-> CCAAAATCAGTAAG <-| GAGGAGAAAGAAACCTAGGGAGAA.

In other words, I have 5 substrings from the subject string. Out of these 5 strings, I am only looking for strings that lie between position 14 - 27 of the subject string. This is my area of interest.

The first [1] substring [TCAGAAACAA] is not that important since it is embedded right at the start (given by the coordinates 1 - 10) and is outside my area of interest.

The second [2] string given by the coordinates 14 - 23 tells me that it in entirely embedded in my area of interest (which again is 14 - 27).

The third [3] string is given by the coordinates 19 - 28. This is important to me as the majority of the string is embedded in my area of interest.

The fourth [4] string is given by the coordinates 20 - 29. Again this is important to me since the majority of the string is embedded in my area of interest except the last the characters.

The story is the same for the fifth substring.

Basically if 60% of the string is embedded in my area of interest I'd like to count it.

Can someone give me an algorithm in pseudocode that can do this? I have been thinking about this for a while drawing diagrams but I can't seem to implement it. I am doing this in R so I will convert the pseudocode to R. Also the number 60% is arbritrary. I'll have to confirm this with my supervisor but I am sure this is irrelevant.

jgritty · Answer 1 · 2015-01-14T22:21:05.647

def substring_index(longstring, substring):
    """Return the index of the substring in longstring."""
    # Python has a built in function for this.

def is_interesting(index, length, interesting_start, interesting_end, percentage):
    """Return true if the substring is interesting."""
    interesting = 0
    uninteresting = 0
    # check if the character at each position from index to index + length
    # is in the interesting range.
    for x in range(index, index + length + 1):
        if interesting_start < x < interesting_end:
            interesting += 1
        else:
            uninteresting += 1
    # Do some math to see if interesting / (interesting + uninteresting) 
    # is bigger than percentage

Use the substring_index function to see if and where the index lies in the longstring.

Use the is_interesting function to return a boolean based on whether the substring is interesting.

So, for the first substring, you could call it this like:

longstring = "TCAGAAACAAAACCCAAAATCAGTAAGGAGGAGAAAGAAACCTAGGGAGAA"
substring = "TCAGAAACAA"
is_interesting(substring_index(longstring, substring), len(substring), 14, 27, 0.6)

score 0 · Answer 2 · answered Jan 14 '15 at 20:17

If I understood well, you need to

Define an 'area of interest' given by a start position and an end position.
Find a string or an accepted portion of a string in the area of interest of the larger string.

So this is what I would do in javascript

var fractionIsInString = function (areaOfInterest, stringToBeFound, acceptedFraction) {
    var fractionLength = Math.floor(stringToBeFound.length*acceptedFraction),
        startPosition = 0,
        endPosition = fractionLength,          
        fraction,
        keepSearching = true;

    do {
        fraction = stringToBeFound.substring(startPosition, endPosition);
        if (areaOfInterest.indexOf(fraction) > -1) {
            return true;
        }
        startPosition++;
        endPosition++;
        keepSearching = endPosition < stringToBeFound.length;
    } while (keepSearching);

    return false;
};

To call it you simply say

fractionIsInString('CCAAAATCAGTAAG', 'TCAGAAACAA', 0.6);

The first parameter is your area of interest, which can be obtained like this

subject.substring(14, 27);

The second parameter is the first of the strings you get from your subject. The one that goes from 0 to 10. The third parameter is the portion of the second parameter that you want to be found. 60% in this case.

How the function works is that it looks for the fraction of the string in the larger string and if the fraction is not found, it moves to the next fraction of the string and so on until it finds a fraction that is found or it reaches the end of the string.

see if a string is embedded in a larger string

2 Answers2