Longest common substring constrained by pattern

Question

Problem:

I have 3 strings s1, s2, s3. Each contain garbage text on either side, with a defining pattern in its centre: text1+number1. number1 increases by 2 in each string. I want to extract text1+number1.

I have already written code to find number1

How would I extend an LCS function to get text1?

#include <iostream>

const std::string longestCommonSubstring(int, std::string const& s1, std::string const& s2, std::string const& s3);

int main(void) {
    std::string s1="hello 5", s2="bolo 7", s3="lo 9sdf";
    std::cout << "Trying to get \"lo 5\", actual result: \"" << longestCommonSubstring(5, s1, s2, s3) << '\"';
}

const std::string longestCommonSubstring(int must_include, std::string const& s1, std::string const& s2, std::string const& s3) {
    std::string longest;

    for(size_t start=0, length=1; start + length <= s1.size();) {
        std::string tmp = s1.substr(start, length);
        if (std::string::npos != s2.find(tmp) && std::string::npos != s3.find(tmp)) {
            tmp.swap(longest);
            ++length;
        } else ++start;
    }

    return longest;
}

Example:

From "hello 5", "bolo 7", "lo 9sdf" I would like to get "lo 5"

Code:

I have been able to write a simple LCS function(test-case) but I am having trouble writing this modified one.

Start with the obvious LCS algorithm and just modify the relevant character by a `+2` and `+4`. — Kerrek SB, Nov 13 '11 at 18:03
You have identified the LCS. Now do you want to extract the _next_ character(s) from each string, check they are numeric, and incremented by 2 each time? The easiest way to do that depends on the data. For instance, what if the 3 strings were "hello 5" "helolo 7" and "helzlo 9sdf" - the LCS will be "hel", but the one you want is "lo". Is that kind of data possible? If it is, you need to modify your LCS to also parse out the numeric part, and validate it. If not, you might be able to keep your LCS algorithm, find the LCS in each string, and parse from there. — Nikki Locke, Nov 13 '11 at 18:35

score 1 · Answer 1 · edited Nov 14 '11 at 07:54

Let's say you're looking for a pattern *n, *n+2, *n+4, etc. And you have the following strings: s1="hello 1,bye 2,ciao 1", s2="hello 3,bye 4,ciao 2" and s3="hello 5,bye 6,ciao 5". Then the following will do:

//find all pattern sequences
N1 = findAllPatterns(s1, number);
 for i = 2 to n:
  for item in Ni-1:
   for match in findAllPatterns(si, nextPattern(item))
    Ni.add([item, (match, indexOf(match))]);

//for all pattern sequences identify the max common substring
maxCommonLength = 0; 
for sequence in Nn:
 temp = findLCS(sequence);
 if(length(temp[0]) > maxCommonLength):
  maxCommonLength = length(temp[0]);
  result = temp;

return result;

` The first part of the algorithm will identify the sequences: [(1, 6), (3, 6), (5, 6)], [(1, 19), (3, 6), (5, 6)], [(2, 12), (4, 12), (6, 12)]

The second part will identify: ["hello 1", "hello 3", "hello 5"] as the longest substrings matching the pattern.

The algorithm can be further optimized by combining the two parts and discarding early sequences that match the pattern but are suboptimal, but I preferred to present it in two parts for better clarity.

-- Edit fixed code block

Thanks, looks like what I am looking for. I will implement and test it soon, then [probably] mark as correct. — A T, Nov 14 '11 at 03:13
Couldn't get it to work, + with a triple nested for loop and recursion, it can't be that efficient. — A T, Nov 14 '11 at 13:30

score 0 · Answer 2 · answered Nov 13 '11 at 18:39

0

If you know number1 already, and you know these numbers all appear just once in their corresponding strings, then the following should work:

I'll call your strings s[0], s[1], etc. Set longest = INT_MAX. For each string s[i] (i >= 0) just:

Find where number1 + 2 * i occurs in s[i]. Suppose it occurs at position j.
If (i == 0) j0 = j; else
- for (k = 1; k <= j && k <= longest && s[i][j - k] == s[0][j0 - k]; ++k) {}
- longest = k;

At the end, longest will be the length of the longest substring common to all the strings.

Basically we're just scanning backwards from the point where we find the number, looking for a mismatch with the corresponding character in your s1 (my s[0]), and keeping track of what the longest matching substring is so far in longest -- this can only stay the same or decrease with each new string we look at.

answered Nov 13 '11 at 18:39

j_random_hacker

50,331
10
105
169

Thanks, but it seems too constrained, especially that for loop. Not sure how it would guarantee `j` (`number`)'s presence in the LCS, it seems to guarantee only that the LCS is less than or equal to the `j`... – A T Nov 14 '11 at 03:12
I don't understand. At the end, `j0` is where the LCS *ends* in `s1`, and it extends backwards `longest` characters. This range of characters doesn't include `number1` because that is different in each string, and may even change length (e.g. going from 9 to 11). What else do you need to know? The position where the LCS ends in any of the strings is just where the particular number is found in that string, and it extends backwards `longest` characters just as it does for `s1`. – j_random_hacker Nov 14 '11 at 03:47

score 0 · Answer 3 · answered Nov 13 '11 at 18:46

0

Rather than try to modify the internals of the LCS algorithm, you could take its output and find it in s1. From there, your number will be located at an offset of the length of the output plus 1.

answered Nov 13 '11 at 18:46

phatfingers

9,770
3
30
44

But I have no guarantee that the longest-common-substring contains a number, or contains the text next to the number. – A T Nov 14 '11 at 03:08
Ah, thank you. Since you've already written the code to find number1 (and presumably the n+2 and n+4), could you not just check each prior character from that point for all three strings? – phatfingers Nov 14 '11 at 03:52

A T · Accepted Answer · 2011-11-14T18:09:10.100

Wrote my own solution:

#include <iostream>
#include <string>
#include <sstream>
#include <vector>

typedef std::pair<std::pair<std::string, std::string>, std::pair<std::pair<std::string, std::string>, std::pair<std::string, std::string>>> pairStringTrio;
typedef std::pair<std::string,std::pair<std::string,std::string>> stringPairString;

stringPairString longestCommonSubstring(const pairStringTrio&);
std::string strFindReplace(const std::string&, const std::string&, const std::string&);

int main(void) {
        std::string s1= "6 HUMAN ACTIONb", s2="8 HUMAN ACTIONd", s3="10 HUMAN ACTIONf";
        pairStringTrio result = std::make_pair(std::make_pair(s1, "6"), std::make_pair(std::make_pair(s2, "8"), std::make_pair(s3, "10")));

        stringPairString answer = longestCommonSubstring(result);
        std::cout << '\"' << answer.first << "\"\t\"" << answer.second.first << "\"\t\"" << answer.second.second << '\"';
}


stringPairString longestCommonSubstring(const pairStringTrio &foo) {
        std::string longest;

        for(size_t start=0, length=foo.first.first.size()-1; start + length <= foo.first.first.size();) {
                std::string s1_tmp = foo.first.first.substr(start, length);
                std::string s2_tmp = strFindReplace(s1_tmp, foo.first.second, foo.second.first.second);
                std::string s3_tmp = strFindReplace(s1_tmp, foo.first.second, foo.second.second.second);

                if (std::string::npos != foo.second.first.first.find(s2_tmp) && std::string::npos != foo.second.second.first.find(s3_tmp)) {
                        s1_tmp.swap(longest);
                        ++length;
                } else ++start;
        }

        return std::make_pair(longest, std::make_pair(strFindReplace(longest, foo.first.second, foo.second.first.second), strFindReplace(longest, foo.first.second, foo.second.second.second)));
}

std::string strFindReplace(const std::string &original, const std::string& src, const std::string& dest) {
        std::string answer=original;
        for(std::size_t pos = 0; (pos = answer.find(src, pos)) != answer.npos;)
                answer.replace(pos, src.size(), dest);
        return answer;
}

Longest common substring constrained by pattern

Problem:

Example:

Code:

4 Answers4