Algorithm to find minimum length in main string to find second string

Question

There is a one question of Algorithm.

Question is as follows:-

You are given a protein String consisting of characters, A, B, C, D. You have to find a minimum length sequence in that.

Example

0 1 2 3 4 5 6 7 8 9 10 11 12
A B A C D C A B C D  C  C  D  

String to find is : BCD 

This string is find between (StartPoint, EndPoint)
1, 4
7, 9
1, 12
7, 12

Minimum length is of 7, 9.

So the answer is 7, 9

My work,

We can solve this using Brute force approach in O(n^2).
We can find the first sequence present in main string by using DP, and my DP logic is as follows,

A = Main string
B = String to be find
DP = Dynamic programming function

n = A.size, m = B.size

Build an array of  DP[m+1][n+1]

DP[i][j], means If in A[0...i], B[0...j] is present or not.

This way we can find our first occurence of B in A. Now after this, I am stuck.

I need some hint from your side.

Please give me hint/guidance only, no code or implementation required.

*"Question is as follows"*? There's currently no question, just an assignment :/ — Zeta, Oct 07 '13 at 09:24
@Zeta This is not an assignment. I found this question on one of the online coding competition. I stuck to develop efficient approach. I just need a start. I am not asking for a complete implementation. — devsda, Oct 07 '13 at 09:29
@Dukeling There they are sharing codes. And already made codes ruin your thinking, thats why I am asking for a hint/start. — devsda, Oct 07 '13 at 09:39
In my experience, hints don't make for particularly good StackOverflow answers, and, to me, code, pseudo-code and a high-level approach mainly differs in terms of readability, not amount of ruin to one's thinking. — Bernhard Barker, Oct 07 '13 at 09:51
this may be of use: http://stackoverflow.com/questions/1369802/substring-algorithm, but may include some code — jev, Oct 07 '13 at 13:35
Already made codes don't ruin your thinking...>_> I read other people's code all the time and it does not hamper my ability to come up with my own solutions. In fact I think it helps it, since I can draw upon examples I have seen in the past. — Shashank, Oct 07 '13 at 15:25
@jev That is for an exact substring, not a variable length substring containing all the characters from the search string (in the same order?) (`BAAAAAACD` is a valid find when looking for `BCD`, it isn't with that question). — Bernhard Barker, Oct 07 '13 at 15:47
should we then expect the the result in the example output being 1,4 (for BACD), istead of 7,9? — jev, Oct 07 '13 at 18:13
@justhalf Yes, I normally posted questions here after putting my best efforts. — devsda, Oct 08 '13 at 05:51

score 0 · Answer 1 · answered Oct 07 '13 at 13:05

Your sample problem and its solution clearly suggests that the solution will always be a numeric pair containing position of first letter of substring and position of last letter of substring i.e.

If the substring is BCD, then solution will be position of B, position of D

Provided that rest of the substring (C in this case) falls in between the solution pair.

So, to give a hint, we can start by finding the positions of first letter of substring in main string and store those positions in an array. Similarly we can find the positions of last letter of substring and store them in an array. This will give us a set of probable solution set wherein each pair will comprise of one number from array 1 and one number from array 2 such that number from array 2 is greater than number from array 1. Now we might end up observing that there is no such pair, which means there is no solution i.e. substring does not exist in main string, or we might end up find one or more such pairs, that means there can be a solution. Now all that is left to do is find out if rest of the substring exists between the solution pairs or not. If there are more than one such pairs found at the end, then just higher number minus the lower number should resolve to the right solution. Hope this helps, as you mentioned you do not wish to know the entire answer, you are just looking for hint :)

score 0 · Answer 2 · edited May 23 '17 at 12:20

Based on the example, I'm assuming the search string needs to be found in the same order as given (i.e. ACB isn't a valid find for ABC).

General DP approach / hints:

The function we're trying to minimize is the distance so far, so this should be the value stored in each cell of your matrix.

For some position in the string and some position in the search string, we need to look back to all previous positions in the string for one position back in the search string. For all of these we need to add the distance to there and record the minimum.

To illustrate, assume a search string of A, B, C, D. Then for ABC in the search string and position i in the string, we need to look at positions 0 through i-1 for AB.

Given a string BACCD and a search string BCD, when looking at the last position of both, we'd have something like:

DP(BACCD, BCD) = min(4+DP(B, BC), 3+DP(BA, BC), 2+DP(BAC, BC), 1+DP(BACC, BC))

But DP(B, BC) and DP(BA, BC) are invalid since B and BA don't contain BC and, more specifically, don't end with a C (thus they can be assigned some arbitrary large value).

Once we get to the last character in the search string, the value would indicate we found the complete search string, ending at that position in the string, thus it should compared to the global minimum.

Optimization:

To get an O(m*n) rather than O(m*n^2) running time, it's worth noting that you can stop iterating backwards as soon as you see another of the current letter (because, any sequence up to that point is longer than the same sequence with only the last letter moved forward), i.e.:

Given a string ABCCD and a search string ABC, when checking the second C, we can stop as soon as we get to the first C (which is right away), since ABC is shorter than ABCC.

Side note:

I think one can do better than the DP approach, but if I were to suggest something else here, it would likely just be copied from / inspired by one of the answers to Find length of smallest window that contains all the characters of a string in another string.

Means, If `A = BABCBDD` , and `B = BCD` , then `DP(BABCB, BCD)` can be calculated by `taking minimum from DP(BABC, BC), DP(BABC, B)`. Here, DP is your minimum function. Is this the same approach, you are talking about ? If this is not your approach, give me your mathematical formula, please. — devsda, Oct 08 '13 at 11:41

Algorithm to find minimum length in main string to find second string

2 Answers2