4

We have to find number of substrings of a String that contain some anagram of another String as a subsequence.

The substrings are considered different only if there start or end positions differ.

String="aba"
anotherString="a"

Occurence of "a" in "aba" is as follows :

a     at index 0..0
ab    at index 0..1
aba   at index 0..2
ba    at index 1..2
a     at index 2..2

i.e total of 5 times...so o/p=5
(the start and end points here, are inclusive)

I thought this question as one of the application of "number of occurrences of a subsequence in a string" and "Find the smallest window in a string containing all characters of another string".

But even after many changes in the combined code I am unable to come up with the solution . Pasting my code is of no use as I know where I am wrong .what I want to know is how can we solve this efficiently without brute force solution.

Code :

public static void findLengthAndSequence(String str1,String str2){

    int begin=0,biginWith=0,endWith=0,count=0,minLen=Integer.MAX_VALUE,len=0;
    int l=0;

    int [] hasFound=new int[256];
    int [] toFound=new int[256];

    for(int i=0;i<str2.length();++i){           
        toFound[(int)str2.charAt(i)]++;
    }

    for(int end=0;end<str1.length();++end){
        if(toFound[(int)str1.charAt(end)]==0)
            continue;
        hasFound[(int)str1.charAt(end)]++;
        if(hasFound[(int)str1.charAt(end)]<=toFound[(int)str1.charAt(end)]){
            count++;
        }

        if(count==str2.length()){
            l++;        //add this to find number of such anagram in string
            System.out.println("l= "+l+" "+begin+" "+end); 
            while(toFound[(int)str1.charAt(begin)]==0 || hasFound[(int)str1.charAt(begin)]>toFound[(int)str1.charAt(begin)]  )
            {
                if(hasFound[(int)str1.charAt(begin)]>toFound[(int)str1.charAt(begin)]){
                    hasFound[(int)str1.charAt(begin)]-=1;                       
                }
                begin++;
            }//while
        len=end-begin+1;
        if(minLen>len){
            minLen=len;
            endWith=end;
            biginWith=begin;
        }
    }//if   
    }//end

    for(int i=biginWith;i<=endWith;++i){
        System.out.print(""+str1.charAt(i));
    }
}

This code gives output =3 to above question. I know I am not able to check every substring in this once I traversed remaining substrings once I reach end of first string.

e.g in "aba" my code checks for a,ab,aba.but once I reach the end it will not check   
ba,a .since we need to count this also as they are having different index values.

Is there any way other than brute force of exponential time complexity to check for every possible substring..

Cyclotron3x3
  • 2,188
  • 23
  • 40
  • Checking all substrings is not exponential. There are exactly `n * (n - 1) / 2` substrings in a string of length `n`. It is obviously a polynomial. – kraskevich Jan 11 '15 at 21:04
  • So what time complexity do you want to achieve? – kraskevich Jan 11 '15 at 21:08
  • @ILoveCoding that was a mistake..thanks..but there are additional cost associated with each substring check of O(MN)...I think we can't get better than O(MN) where M and N are string lengths.Do you have any algo for this question dude ? – Cyclotron3x3 Jan 11 '15 at 21:54

1 Answers1

5

Here is a simple solution with O(n + m) time complexity(I assume that alphabet size is a constant(where n is the length the first string(the one we want to count substrings in) and m is the length of the second string(the anagram string)). I will call a substring that contains an anagram of the second string "good".

  1. Let's define count(x, y) as the number of occurrences of a y character in a string x. Then an arbitrary string s contains an anagram of a string t as a subsequence if and only if count(s, c) >= count(t, c) for all c(the proof is simple so I will omit it).

  2. Let's define firstRight(L) as the smallest R such that a [L, R] substring is a good one(it is possible that there is no such R). Then firstRight(L) <= firstRight(L + 1) for all valid L(because of the 1. and the properties of the count(x, y) function).

  3. The statment 1. implies that any string can be represented as a vector with alphabetSize elements, where the i-th element of this vector is the number of occurrences of the character i. The statement 2. implies that we can use two pointers.

  4. So a pseudo code of this algorithm looks like this:

    def getCharacterVector(string s):
        result = a vector filled with zeros
        for c in s
            result[c]++
        return result
    
    // Checks that all elements of the first vector
    // are greater than or equal to the corresponding
    // elements of the second vector
    def isGreaterOrEqual(first, second)
        for i = 0 ... length(first)
            if first[i] < second[i]
                return false
        return true
    
    def countSubstrings(string s, string t)
        vT = getCharacterVector(t)
        vS = a vector filled with zeros
        right = 0
        // computes firstRight(0)
        while (right < length(s) and not isGreaterOrEqual(vS, vT))
            vS[s[right]]++
            right++
        if not isGreaterOrEqual(vS, vT) // firstRight(0) is undefined
            return 0 // there are no such substrings
        res = length(s) - right + 1
        for left = 1 ... length(s) - 1
            vS[s[left - 1]]--
            // computes firstRight(left)
            while right < length(s) and vS[s[left - 1]] < vT[s[left - 1]]
                vS[s[right]]++
                right++
            if vS[s[left - 1]] < vT[s[left - 1]] // firstRight(left) is undefined
                break // we are done
             res += length(s) - right + 1
        return res
    

    The idea here is two compute the number of good substrings that start in a fixed position and end anywhere and use two pointers two adjust the right border efficiently. The time complexity of this implementation is O(N * ALPHABET_SIZE + M)(which is O(N + M) if we treat the alphabet size as a constant), but is actually possible to do the firstRight(0) computation more efficient by keeping track of the "bad" positions in vS and vT vector and represent this vectors as hash tables to achieve O(N + M) the complexity regardless of the alphabet size.

kraskevich
  • 18,368
  • 4
  • 33
  • 45
  • Looks good - another way to make this linear time is in firstRight(0) computation to keep track of how many matches have been found. When vS[s[right]] becomes equal to vT[s[right]] for the first time you can increment the number of matches by vS[s[right]]. The loop can stop once matches becomes equal to length(t). – Peter de Rivaz Jan 11 '15 at 22:35