Find the number of substrings of a String that contain some anagram of another String as a subsequence

Question

We have to find number of substrings of a String that contain some anagram of another String as a subsequence.

The substrings are considered different only if there start or end positions differ.

String="aba"
anotherString="a"

Occurence of "a" in "aba" is as follows :

a     at index 0..0
ab    at index 0..1
aba   at index 0..2
ba    at index 1..2
a     at index 2..2

i.e total of 5 times...so o/p=5
(the start and end points here, are inclusive)

I thought this question as one of the application of "number of occurrences of a subsequence in a string" and "Find the smallest window in a string containing all characters of another string".

But even after many changes in the combined code I am unable to come up with the solution . Pasting my code is of no use as I know where I am wrong .what I want to know is how can we solve this efficiently without brute force solution.

Code :

public static void findLengthAndSequence(String str1,String str2){

    int begin=0,biginWith=0,endWith=0,count=0,minLen=Integer.MAX_VALUE,len=0;
    int l=0;

    int [] hasFound=new int[256];
    int [] toFound=new int[256];

    for(int i=0;i<str2.length();++i){           
        toFound[(int)str2.charAt(i)]++;
    }

    for(int end=0;end<str1.length();++end){
        if(toFound[(int)str1.charAt(end)]==0)
            continue;
        hasFound[(int)str1.charAt(end)]++;
        if(hasFound[(int)str1.charAt(end)]<=toFound[(int)str1.charAt(end)]){
            count++;
        }

        if(count==str2.length()){
            l++;        //add this to find number of such anagram in string
            System.out.println("l= "+l+" "+begin+" "+end); 
            while(toFound[(int)str1.charAt(begin)]==0 || hasFound[(int)str1.charAt(begin)]>toFound[(int)str1.charAt(begin)]  )
            {
                if(hasFound[(int)str1.charAt(begin)]>toFound[(int)str1.charAt(begin)]){
                    hasFound[(int)str1.charAt(begin)]-=1;                       
                }
                begin++;
            }//while
        len=end-begin+1;
        if(minLen>len){
            minLen=len;
            endWith=end;
            biginWith=begin;
        }
    }//if   
    }//end

    for(int i=biginWith;i<=endWith;++i){
        System.out.print(""+str1.charAt(i));
    }
}

This code gives output =3 to above question. I know I am not able to check every substring in this once I traversed remaining substrings once I reach end of first string.

e.g in "aba" my code checks for a,ab,aba.but once I reach the end it will not check   
ba,a .since we need to count this also as they are having different index values.

Is there any way other than brute force of exponential time complexity to check for every possible substring..

Checking all substrings is not exponential. There are exactly `n * (n - 1) / 2` substrings in a string of length `n`. It is obviously a polynomial. — kraskevich, Jan 11 '15 at 21:04
@ILoveCoding that was a mistake..thanks..but there are additional cost associated with each substring check of O(MN)...I think we can't get better than O(MN) where M and N are string lengths.Do you have any algo for this question dude ? — Cyclotron3x3, Jan 11 '15 at 21:54

score 5 · Answer 1 · answered Jan 11 '15 at 22:24

Here is a simple solution with O(n + m) time complexity(I assume that alphabet size is a constant(where n is the length the first string(the one we want to count substrings in) and m is the length of the second string(the anagram string)). I will call a substring that contains an anagram of the second string "good".

Let's define count(x, y) as the number of occurrences of a y character in a string x. Then an arbitrary string s contains an anagram of a string t as a subsequence if and only if count(s, c) >= count(t, c) for all c(the proof is simple so I will omit it).
Let's define firstRight(L) as the smallest R such that a [L, R] substring is a good one(it is possible that there is no such R). Then firstRight(L) <= firstRight(L + 1) for all valid L(because of the 1. and the properties of the count(x, y) function).
The statment 1. implies that any string can be represented as a vector with alphabetSize elements, where the i-th element of this vector is the number of occurrences of the character i. The statement 2. implies that we can use two pointers.

So a pseudo code of this algorithm looks like this:

def getCharacterVector(string s):
    result = a vector filled with zeros
    for c in s
        result[c]++
    return result

// Checks that all elements of the first vector
// are greater than or equal to the corresponding
// elements of the second vector
def isGreaterOrEqual(first, second)
    for i = 0 ... length(first)
        if first[i] < second[i]
            return false
    return true

def countSubstrings(string s, string t)
    vT = getCharacterVector(t)
    vS = a vector filled with zeros
    right = 0
    // computes firstRight(0)
    while (right < length(s) and not isGreaterOrEqual(vS, vT))
        vS[s[right]]++
        right++
    if not isGreaterOrEqual(vS, vT) // firstRight(0) is undefined
        return 0 // there are no such substrings
    res = length(s) - right + 1
    for left = 1 ... length(s) - 1
        vS[s[left - 1]]--
        // computes firstRight(left)
        while right < length(s) and vS[s[left - 1]] < vT[s[left - 1]]
            vS[s[right]]++
            right++
        if vS[s[left - 1]] < vT[s[left - 1]] // firstRight(left) is undefined
            break // we are done
         res += length(s) - right + 1
    return res

The idea here is two compute the number of good substrings that start in a fixed position and end anywhere and use two pointers two adjust the right border efficiently. The time complexity of this implementation is O(N * ALPHABET_SIZE + M)(which is O(N + M) if we treat the alphabet size as a constant), but is actually possible to do the firstRight(0) computation more efficient by keeping track of the "bad" positions in vS and vT vector and represent this vectors as hash tables to achieve O(N + M) the complexity regardless of the alphabet size.

Looks good - another way to make this linear time is in firstRight(0) computation to keep track of how many matches have been found. When vS[s[right]] becomes equal to vT[s[right]] for the first time you can increment the number of matches by vS[s[right]]. The loop can stop once matches becomes equal to length(t). — Peter de Rivaz, Jan 11 '15 at 22:35

Find the number of substrings of a String that contain some anagram of another String as a subsequence

1 Answers1