Find length of smallest window that contains all the characters of a string in another string

Question

Recently i have been interviewed. I didn't do well cause i got stuck at the following question

suppose a sequence is given : A D C B D A B C D A C D and search sequence is like: A C D

task was to find the start and end index in given string that contains all the characters of search string preserving the order.

Output: assuming index start from 1:

start index 10 end index 12

explanation :

1.start/end index are not 1/3 respectively because though they contain the string but order was not maintained

2.start/end index are not 1/5 respectively because though they contain the string in the order but the length is not optimum

3.start/end index are not 6/9 respectively because though they contain the string in the order but the length is not optimum

Please go through How to find smallest substring which contains all characters from a given string?.

But the above question is different since the order is not maintained. I'm still struggling to maintain the indexes. Any help would be appreciated . thanks

The goal is to find the **shortest** ordered sequence? or to find this specific sequence? Let's say in you example the last 3 elements were gone, should the answer be 6/9 or "doesn't exist"? — Ron Teller, Oct 06 '13 at 08:02
What about "ADCBD"? That's also a subsequence containing all the characters in given order (and then some). That would invalidate amit's answer... — Aki Suihkonen, Oct 06 '13 at 08:15
N^2 algorithm is pretty obvious, but i suppose you want something better? — siledh, Oct 06 '13 at 09:41
perhaps using the standard problem and solution of LCS would help, taking the source string as the first one. It has a DP solution — fkl, Oct 07 '13 at 07:04
@AnkushDubey have you checked pattern matching algos like `knuth morris pratt` http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm — exexzian, Oct 07 '13 at 17:46
also check this interactive java applet http://www.enseignement.polytechnique.fr/informatique/profs/Jean-Jacques.Levy/00/pc4/strmatch/e.html — exexzian, Oct 07 '13 at 17:50
@AnkushDubey i think there is a O(N) algorithm if the smaller string's length is 3. I was asked the same question in an interview and the restriction was the smaller string length is 3. If you can conform then i can go ahead to describe the algorithm. — Trying, Nov 26 '13 at 01:07

xmoex · Answer 1 · 2013-10-14T10:07:18.277

I tried to write some simple c code to solve the problem:

Update:

I wrote a search function that looks for the required characters in correct order, returning the length of the window and storing the window start point to ìnt * startAt. The function processes a sub-sequence of given hay from specified startpoint int start to it's end

The rest of the algorithm is located in main where all possible subsequences are tested with a small optimisation: we start looking for the next window right after the startpoint of the previous one, so we skip some unnecessary turns. During the process we keep track f the 'till-now best solution

Complexity is O(n*n/2)

Update2:

unnecessary dependencies have been removed, unnecessary subsequent calls to strlen(...) have been replaced by size parameters passed to search(...)

#include <stdio.h>

// search for single occurrence
int search(const char hay[], int haySize, const char needle[], int needleSize, int start, int * startAt)
{
    int i, charFound = 0;

    // search from start to end
    for (i = start; i < haySize; i++)
    {
        // found a character ?
        if (hay[i] == needle[charFound])
        {               
            // is it the first one?
            if (charFound == 0) 
                *startAt = i;   // store starting position
            charFound++;    // and go to next one
        }
        // are we done?
        if (charFound == needleSize)
            return i - *startAt + 1;    // success
    }
    return -1;  // failure
}

int main(int argc, char **argv)
{

    char hay[] = "ADCBDABCDACD";
    char needle[] = "ACD";

    int resultStartAt, resultLength = -1, i, haySize = sizeof(hay) - 1, needleSize = sizeof(needle) - 1;

    // search all possible occurrences
    for (i = 0; i < haySize - needleSize; i++)
    {
        int startAt, length;

        length = search(hay, haySize, needle, needleSize, i, &startAt);

        // found something?
        if (length != -1)
        {
            // check if it's the first result, or a one better than before
            if ((resultLength == -1) || (resultLength > length))
            {
                resultLength = length;
                resultStartAt = startAt;
            }
            // skip unnecessary steps in the next turn
            i = startAt;
        }
    }

    printf("start at: %d, length: %d\n", resultStartAt, resultLength);

    return 0;
}

@devsda: i don't think down voting should be used since you didn't like the answer. Code can be given to people who seek help. — Harikrishnan, Oct 07 '13 at 10:40
@XMOEX I upvoted your answer. Actually, whenever you post any code, you first have to write your algorithm with their time complexity. This is a write practice. — devsda, Oct 07 '13 at 11:45

Abhishek Bansal · Answer 2 · 2013-10-06T11:50:43.890

Start from the beginning of the string.

If you encounter an A, then mark the position and push it on a stack. After that, keep checking the characters sequentially until
1. If you encounter an A, update the A's position to current value.
2. If you encounter a C, push it onto the stack.

After you encounter a C, again keep checking the characters sequentially until,
1. If you encounter a D, erase the stack containing A and C and mark the score from A to D for this sub-sequence.
2. If you encounter an A, then start another Stack and mark this position as well.
2a. If now you encounter a C, then erase the earlier stacks and keep the most recent stack.
2b. If you encounter a D, then erase the older stack and mark the score and check if it is less than the current best score.

Keep doing this till you reach the end of the string.

The pseudo code can be something like:

Initialize stack = empty;
Initialize bestLength = mainString.size() + 1; // a large value for the subsequence.
Initialize currentLength = 0;
for ( int i = 0; i < mainString.size(); i++ ) {

  if ( stack is empty ) {
    if ( mainString[i] == 'A' ) {
      start a new stack and push A on it.
      mark the startPosition for this stack as i.
    }
    continue;
  }

  For each of the stacks ( there can be at most two stacks prevailing, 
                           one of size 1 and other of size 0 ) {
    if ( stack size == 1 ) // only A in it {
      if ( mainString[i] == 'A' ) {
        update the startPosition for this stack as i.
      }
      if ( mainString[i] == 'C' ) {
        push C on to this stack.
      }
    } else if ( stack size == 2 ) // A & C in it {
      if ( mainString[i] == 'C' ) {
        if there is a stack with size 1, then delete this stack;// the other one dominates this stack.
      }
      if ( mainString[i] == 'D' ) {
        mark the score from startPosition till i and update bestLength accordingly.
        delete this stack.
      }
    }

  }

}

Excellent, exactly what I was thinking. Am I correct in calling this an O(n*m) solution? — Jongware, Oct 06 '13 at 11:26
Sorry, the description is a bit vague for me to assess either its correctness or its complexity. Could you translate it to (pseudo)code? — nickie, Oct 06 '13 at 11:34
@nickie: since the search string is in the correct order, you are only interested in the *last* A that is followed by C and D. So while you are scanning for Cs, you can update the "last seen" A position. Similarly, until a D is found, you want to keep track of "A-then-C" positions -- it's useless to only update the C if there is no A before. — Jongware, Oct 06 '13 at 11:40

Ron Teller · Answer 3 · 2013-10-06T11:41:03.027

I modified my previous suggestion using a single queue, now I believe this algorithm runs with O(N*m) time:

FindSequence(char[] sequenceList)
{
    queue startSeqQueue;
    int i = 0, k;
    int minSequenceLength = sequenceList.length + 1;
    int startIdx = -1, endIdx = -1;

    for (i = 0; i < sequenceList.length - 2; i++)
    {
        if (sequenceList[i] == 'A')
        {
            startSeqQueue.queue(i);
        }
    }

    while (startSeqQueue!=null)
    {
        i = startSeqQueue.enqueue();
        k = i + 1;

        while (sequenceList.length < k && sequenceList[k] != 'C')
            if (sequenceList[i] == 'A') i = startSeqQueue.enqueue();
            k++;

        while (sequenceList.length < k && sequenceList[k] != 'D')
            k++;

        if (k < sequenceList.length && k > minSequenceLength > k - i + 1)
        {
            startIdx = i;
            endIdx = j;
            minSequenceLength = k - i + 1;
        }
    }

    return startIdx & endIdx
}

My previous (O(1) memory) suggestion:

FindSequence(char[] sequenceList)
{
    int i = 0, k;
    int minSequenceLength = sequenceList.length + 1;
    int startIdx = -1, endIdx = -1;

    for (i = 0; i < sequenceList.length - 2; i++)
        if (sequenceList[i] == 'A')
            k = i+1;
            while (sequenceList.length < k && sequenceList[k] != 'C')
                k++;
            while (sequenceList.length < k && sequenceList[k] != 'D')
                k++;

            if (k < sequenceList.length && k > minSequenceLength > k - i + 1)
            {
                startIdx = i;
                endIdx = j;
                minSequenceLength = k - i + 1;
            }

    return startIdx & endIdx;
}

@xmoex: this easily generalizes to any substring, with the same time and space complexity. — nickie, Oct 06 '13 at 10:58
You can generalize it without a fixed number of characters for the pattern. But I'd be more happy with a faster algorithm, say O(n*m) where m is the length of the pattern. — nickie, Oct 06 '13 at 11:03

score 0 · Answer 4 · answered Oct 06 '13 at 12:16

Here's my version. It keeps track of possible candidates for an optimum solution. For each character in the hay, it checks whether this character is in sequence of each candidate. It then selectes the shortest candidate. Quite straightforward.

class ShortestSequenceFinder
{
    public class Solution
    {
        public int StartIndex;
        public int Length;
    }

    private class Candidate
    {
        public int StartIndex;
        public int SearchIndex;
    }

    public Solution Execute(string hay, string needle)
    {
        var candidates = new List<Candidate>();
        var result = new Solution() { Length = hay.Length + 1 };
        for (int i = 0; i < hay.Length; i++)
        {
            char c = hay[i];
            for (int j = candidates.Count - 1; j >= 0; j--)
            {
                if (c == needle[candidates[j].SearchIndex])
                {
                    if (candidates[j].SearchIndex == needle.Length - 1)
                    {
                        int candidateLength = i - candidates[j].StartIndex;
                        if (candidateLength < result.Length)
                        {
                            result.Length = candidateLength;
                            result.StartIndex = candidates[j].StartIndex;
                        }
                        candidates.RemoveAt(j);
                    }
                    else
                    {
                        candidates[j].SearchIndex += 1;
                    }
                }
            }
            if (c == needle[0])
                candidates.Add(new Candidate { SearchIndex = 1, StartIndex = i });
        }
        return result;
    }
}

It runs in O(n*m).

You could have used `if((c == needle[candidates[j].SearchIndex]) && (candidates[j].SearchIndex == needle.Length - 1))` — Harikrishnan, Oct 07 '13 at 10:46
I don't think so. If the current char (`c`) matches the current needle position, I will advance the current needle position of the candidate being examined. However, if the candidate reached the end of the needle, I have to compare this candidate with the current best solution. I cannot combine the two conditions. — alzaimar, Oct 07 '13 at 14:12

Shashank · Answer 5 · 2013-10-07T16:20:54.753

Here is my solution in Python. It returns the indexes assuming 0-indexed sequences. Therefore, for the given example it returns (9, 11) instead of (10, 12). Obviously it's easy to mutate this to return (10, 12) if you wish.

def solution(s, ss):
    S, E = [], []
    for i in xrange(len(s)):
        if s[i] == ss[0]:
            S.append(i)
        if s[i] == ss[-1]:
            E.append(i)
    candidates = sorted([(start, end) for start in S for end in E
                        if start <= end and end - start >= len(ss) - 1],
                        lambda x,y: (x[1] - x[0]) - (y[1] - y[0]))
    for cand in candidates:
        i, j = cand[0], 0
        while i <= cand[-1]:
            if s[i] == ss[j]:
                j += 1
            i += 1
        if j == len(ss):
            return cand

Usage:

>>> from so import solution
>>> s = 'ADCBDABCDACD'
>>> solution(s, 'ACD')
(9, 11)
>>> solution(s, 'ADC')
(0, 2)
>>> solution(s, 'DCCD')
(1, 8)
>>> solution(s, s)
(0, 11)
>>> s = 'ABC'
>>> solution(s, 'B')
(1, 1)
>>> print solution(s, 'gibberish')
None

I think the time complexity is O(p log(p)) where p is the number of pairs of indexes in the sequence that refer to search_sequence[0] and search_sequence[-1] where the index for search_sequence[0] is less than the index forsearch_sequence[-1] because it sorts these p pairings using an O(n log n) algorithm. But then again, my substring iteration at the end could totally overshadow that sorting step. I'm not really sure.

It probably has a worst-case time complexity which is bounded by O(n*m) where n is the length of the sequence and m is the length of the search sequence, but at the moment I cannot think of an example worst-case.

score 0 · Answer 6 · answered Oct 08 '13 at 08:57

Here is my O(m*n) algorithm in Java:

class ShortestWindowAlgorithm {

    Multimap<Character, Integer> charToNeedleIdx; // Character -> indexes in needle, from rightmost to leftmost | Multimap is a class from Guava
    int[] prefixesIdx; // prefixesIdx[i] -- rightmost index in the hay window that contains the shortest found prefix of needle[0..i]
    int[] prefixesLengths; // prefixesLengths[i] -- shortest window containing needle[0..i]

    public int shortestWindow(String hay, String needle) {
        init(needle);
        for (int i = 0; i < hay.length(); i++) {
            for (int needleIdx : charToNeedleIdx.get(hay.charAt(i))) {
                if (firstTimeAchievedPrefix(needleIdx) || foundShorterPrefix(needleIdx, i)) {
                    prefixesIdx[needleIdx] = i;
                    prefixesLengths[needleIdx] = getPrefixNewLength(needleIdx, i);
                    forgetOldPrefixes(needleIdx);
                }
            }
        }
        return prefixesLengths[prefixesLengths.length - 1];
    }

    private void init(String needle) {
        charToNeedleIdx = ArrayListMultimap.create();
        prefixesIdx = new int[needle.length()];
        prefixesLengths = new int[needle.length()];
        for (int i = needle.length() - 1; i >= 0; i--) {
            charToNeedleIdx.put(needle.charAt(i), i);
            prefixesIdx[i] = -1;
            prefixesLengths[i] = -1;
        }
    }

    private boolean firstTimeAchievedPrefix(int needleIdx) {
        int shortestPrefixSoFar = prefixesLengths[needleIdx];
        return shortestPrefixSoFar == -1 && (needleIdx == 0 || prefixesLengths[needleIdx - 1] != -1);
    }

    private boolean foundShorterPrefix(int needleIdx, int hayIdx) {
        int shortestPrefixSoFar = prefixesLengths[needleIdx];
        int newLength = getPrefixNewLength(needleIdx, hayIdx);
        return newLength <= shortestPrefixSoFar;
    }

    private int getPrefixNewLength(int needleIdx, int hayIdx) {
        return needleIdx == 0 ? 1 : (prefixesLengths[needleIdx - 1] + (hayIdx - prefixesIdx[needleIdx - 1]));
    }

    private void forgetOldPrefixes(int needleIdx) {
        if (needleIdx > 0) {
            prefixesLengths[needleIdx - 1] = -1;
            prefixesIdx[needleIdx - 1] = -1;
        }
    }
}

It works on every input and also can handle repeated characters etc.

Here are some examples:

public class StackOverflow {

    public static void main(String[] args) {
        ShortestWindowAlgorithm algorithm = new ShortestWindowAlgorithm();
        System.out.println(algorithm.shortestWindow("AXCXXCAXCXAXCXCXAXAXCXCXDXDXDXAXCXDXAXAXCD", "AACD")); // 6
        System.out.println(algorithm.shortestWindow("ADCBDABCDACD", "ACD")); // 3
        System.out.println(algorithm.shortestWindow("ADCBDABCD", "ACD")); // 4
    }

score 0 · Answer 7 · answered Oct 14 '13 at 14:05

I haven't read every answer here, but I don't think anyone has noticed that this is just a restricted version of local pairwise sequence alignment, in which we are only allowed to insert characters (and not delete or substitute them). As such it will be solved by a simplification of the Smith-Waterman algorithm that considers only 2 cases per vertex (arriving at the vertex either by matching a character exactly, or by inserting a character) rather than 3 cases. This algorithm is O(n^2).

seeker · Answer 8 · 2013-11-12T00:26:33.553

Here's my solution. It follows one of the pattern matching solutions. Please comment/correct me if I'm wrong.

Given the input string as in the question A D C B D A B C D A C D. Let's first compute the indices where A occurs. Assuming a zero based index this should be [0,5,9].

Now the pseudo code is as follows.

    Store the indices of A in a list say *orders*.// orders=[0,5,9]
    globalminStart, globalminEnd=0,localMinStart=0,localMinEnd=0;
    for (index: orders)
     {
       int i =index;
       Stack chars=new Stack();// to store the characters
      i=localminStart;
     while(i< length of input string)
       { 
           if(str.charAt(i)=='C') // we've already seen A, so we look for C
           st.push(str.charAt(i));
           i++;
           continue;
           else if(str.charAt(i)=='D' and st.peek()=='C')
           localminEnd=i; // we have a match! so assign value of i to len
           i+=1;
           break;
           else if(str.charAt(i)=='A' )// seen the next A
           break;
    }
     if (globalMinEnd-globalMinStart<localMinEnd-localMinStart)
     {
       globalMinEnd=localMinEnd;
       globalMinStart=localMinStart;
     }
   }

    return [globalMinstart,globalMinEnd]
    }

P.S: this is pseudocode and a rough idea. Id be happy to correct it and understand if there's something wrong.

AFAIC Time complexity -O(n). Space complexity O(n)

Find length of smallest window that contains all the characters of a string in another string

8 Answers8

Linked

Related