Algorithm for finding all of the shared substrings of any length between 2 strings, and then counting occurrences in string 2?

Question

I've run into an unusual challenge and so far I'm unable to determine the most efficient algorithm to attack this.

Given the following 2 strings as an example, find all commonly shared substrings between the 2 strings of any length, and count the number of occurrences of all of those shared substrings in string 2. Your algorithm also needs to be able to compute shared substrings between files containing strings that are up to 100MB or more in size.

Example:

String 1: ABCDE512ABC361EG51D

String 2: ADE5AHDW4131EG1DG5C

Given these 2 strings this algorithm would find the following shared substrings: A,C,D,E,5,1,3,G,DE,E5,EG,G5,1D,DE5,1EG

And then from these commonly shared substrings, we'd find how many occurences there are of each of them in string 2.

A: 2 occurrences in string 2

C: 1 occurence in string 2

D: 3 occurrences in string 2

etc..

The first approach I took to wrap my head around this problem was brute forcing my way through computing the common shared substrings using 2 nested for loops - obviously the least efficient but it was a quick and dirty way to get an idea of what the expected outputs should be with smaller test input and the slowest possible time to run, which was around 2 minutes to compute all common shared substrings between 2 files containing ascii strings with a size of 50kb. Upping the size to 1mb made this come to a screeching halt due to the massive number of total nested iterations that had to occur to compute this.

The next approach was using trees - seeing how much memory I could trade off to optimize compute time. This approach was much faster. The same two 50kb files that took 2 minute with the brute force method were near instant. Running against 1mb files was very fast too still (seconds) but as I continued to test with larger and larger file sizes, I quickly began running into memory issues due to tree sizes.

Note: The string files will only ever contain ASCII characters!

Edit:

I'm escalating this a bit further, please see:

https://gist.github.com/braydo25/f7a9ce7ce7ad7c5fb11ec511887789bc

It can be any ASCII character. I will update the question to account for this. — Braydon Batungbacal, Nov 04 '16 at 22:45
Right, which is why trees grow unbounded. It's not an easy problem. Considering the length can be anything. — Braydon Batungbacal, Nov 04 '16 at 22:53
If you know all substrings of length `l` and their occurrences in both strings, then all substrings of length `l+1` can only be located at the same positions and be extended by the character right after the occurrence (in the reference string). This reduces your search space dramatically. You don't even have to search. You know where to look. — Nico Schertler, Nov 04 '16 at 23:14
@NicoSchertler - We don't know any substrings of any length to start. — Braydon Batungbacal, Nov 04 '16 at 23:28
Of course, you have to find all substrings of length 1 at the beginning. Once you have this, the rest can be found incrementally. — Nico Schertler, Nov 04 '16 at 23:31
Have you tried to use a suffix automaton? It usually consumes less memory than a tree (the asymptotics is the same, but the constant is smaller). — kraskevich, Nov 04 '16 at 23:37
Related: https://en.wikipedia.org/wiki/Longest_common_substring_problem — Robert Harvey, Nov 04 '16 at 23:38
So making some headway, it seems leveraging temporary files with a sliding window approach may be the way to go here. Any additional ideas would be very helpful. This has to have been done before. — Braydon Batungbacal, Nov 05 '16 at 02:29
(I would be more comfortable with slight edits: (_MB_ for megabytes, _two_ instead of 2,) _files containing one string each_ or _files containing strings, each string up to hundreds of MB_.) (I'd rather do without phrases like `up to [measure] or more`) — greybeard, Dec 15 '16 at 23:21
Another approach would seem to be to build a "tallying" suffix array of `s2` and search for substrings of `s1`. — greybeard, Dec 15 '16 at 23:42
@greybeard I agree, mb = milli bits instead of mega bytes (MB). Of course it's clear from the context, but it's still wrong. — maraca, Dec 16 '16 at 01:55
Is it possible to get degenerate input case where both strings are letter 'A' 100 000 000 times? — vguberinic, Dec 16 '16 at 19:33
It "is" possible, but highly improbable. For this case let's say it just won't happen. Expect the input to have a high level of entropy. — Braydon Batungbacal, Dec 16 '16 at 22:18
Can you share the files that you have been testing with, along with their expected results? — Tim, Dec 16 '16 at 23:39
I can't explain why but this looks a lot like genetic sequencer alignment work. — Iwillnotexist Idonotexist, Dec 17 '16 at 07:36
I like David Eisenstat's idea over at github to exclude from reporting `substrings that don't occur more often than the superstring` (or even _substrings that don't occur outside one reported superstring_ or _any reported superstring_). — greybeard, Dec 22 '16 at 00:58

Nico Schertler · Answer 1 · 2016-12-16T01:16:44.477

Here is some code illustrating the idea I presented in the comments above. Although it is runnable C++ code, it is more pseudo-code in the sense that the utilized data structures are surely not optimal but they allow a clear view on the algorithm.

struct Occurrence
{
    //The vectors contain indices to the first character of the occurrence in ...
    std::vector<size_t> s1;  // ... string 1 and ...
    std::vector<size_t> s2;  // ... string 2.
};

int main()
{
    //If you cannot load the entire strings in memory, a memory-mapped file might be
    //worth considering
    std::string s1 = "ABCDE512ABC361EG51D";
    std::string s2 = "ADE5AHDW4131EG1DG5C";

    //These vectors store the occurrences of substrings for the current and next length
    std::vector<Occurrence> occurrences, nextOccurrences;
    int length = 1;

    std::map<char, Occurrence> occurrenceMap;
    //Initialize occurrences
    for (int i = 0; i < s1.length(); ++i)
        occurrenceMap[s1[i]].s1.push_back(i);
    for (int i = 0; i < s2.length(); ++i)
        occurrenceMap[s2[i]].s2.push_back(i);

    for (auto& pair : occurrenceMap)
    {
        if (pair.second.s1.size() > 0 && pair.second.s2.size() > 0)
            occurrences.push_back(std::move(pair.second));
    }

    do
    {
        nextOccurrences.clear();

        std::cout << "Length " << length << std::endl;
        for(auto& o : occurrences)
        {
            std::cout << std::string(s1.c_str() + o.s1[0], length) << " occurred "
                      << o.s1.size() << " / " << o.s2.size() << " times." << std::endl;

            //Expand the occurrence
            occurrenceMap.clear();
            for (auto p : o.s1)
            {
                if (p + length < s1.length())
                    occurrenceMap[s1[p + length]].s1.push_back(p);
            }                   
            for (auto p : o.s2)
            {
                if (p + length < s2.length())
                occurrenceMap[s2[p + length]].s2.push_back(p);
            }
            for (auto& pair : occurrenceMap)
            {
                if (pair.second.s1.size() > 0 && pair.second.s2.size() > 0)
                    nextOccurrences.push_back(std::move(pair.second));
            }
        }

        ++length;
        std::swap(occurrences, nextOccurrences);

    } while (!occurrences.empty());


    return 0;
}

Output:

Length 1
1 occurred 3 / 3 times.
3 occurred 1 / 1 times.
5 occurred 2 / 2 times.
A occurred 2 / 2 times.
C occurred 2 / 1 times.
D occurred 2 / 3 times.
E occurred 2 / 2 times.
G occurred 1 / 2 times.
Length 2
1D occurred 1 / 1 times.
1E occurred 1 / 1 times.
DE occurred 1 / 1 times.
E5 occurred 1 / 1 times.
EG occurred 1 / 1 times.
G5 occurred 1 / 1 times.
Length 3
1EG occurred 1 / 1 times.
DE5 occurred 1 / 1 times.

The most amount of memory will be used during initialization because there will be an entry for every character of both input strings. If you know the approximate length of the strings, you can choose a more appropriate index data type than size_t. The amount of memory needed is in the order of the input size. So two 100 MB files should be no problem for common computers. After the initialization (more specifically, after the first iteration of the loop), most of these data will be deleted because it is not needed any more.

I applaud you, this definitely worked. Just ran comparisons against 2, 200mb string files. Thanks! — Braydon Batungbacal, Nov 05 '16 at 09:40
After taking some time to go through this, it does seem like there's a bug, especially with larger files. I'm seeing incorrect counts being outputted to the console occasionally. — Braydon Batungbacal, Nov 26 '16 at 11:07
The approach itself is clever, I've just gotten to wrapping my ahead around this. I'm curious where you got the inspiration from to write it this way? Links to any specific algorithms, problems or resources you referenced would be great. — Braydon Batungbacal, Nov 26 '16 at 11:08
From what I understand, in the expand occurrence section if occurenceMap[s1[p + length (let's say length = 0) ]]; is "A", and file1[p + length (lets say length = 1)] is "B" and within the s2 loop.. s2[p+length (length = 0)] is "C" and s2[p+length (length = 1)] is "B", this would incorrectly be considered a match on "CB", because the map would move the starting index of "CB" to the key "B", and since a matched occurrence is determined by if a given key in the map has values for S1 / S2, this would cause an invalid match it seems as S1 would also have a starting index in key "B" for "AB" — Braydon Batungbacal, Nov 26 '16 at 11:26
I don't claim that this algorithm is waterproof, but I don't see the problem you describe. The `for(auto& o : occurences)` loop loops over all duplicate occurrences in both strings, i.e. `o` represents any string (e.g. "ABC"). Note that `length` can never be 0. The `s1` loop then checks every character that expands "ABC" in `s1`, e.g. "ABCX" and "ABCY". The same for `s2`, let's say "ABCY" and ABCZ". The final loop just checks if there is the same expansion in both, i.e. "ABCY". The starting indices of the occurrences are not changed (`.push_back(p)`). Are you sure you get wrong results? — Nico Schertler, Nov 26 '16 at 13:21
For the other question, it was merely the observation that you do not need to recalculate everything from scratch but you could rather use some of the previous results. That's more a general concept instead of a specific algorithm. You might call it dynamic programming because the subproblems you solve overlap. But it is pretty far from the usual use cases of DP. — Nico Schertler, Nov 26 '16 at 13:24
@BraydonBatungbacal: `if occurrenceMap[s1[p+length - 1]]; is "A", and file1[p+length] is "B" and within the s2 loop …s2[p+length - 1] is "C" and s2[p+length] is "B", this would incorrectly be considered a match on "CB", because the map would [map key "B" to the starting index of "CB", …] this would cause an invalid match it seems as S1 would also have a starting index in key "B" for "AB"` - note that `occurrenceMap` is cleared for each `o : occurrences`, e.g. after "A" as well as before "C": at any one time, all the one character "extensions" in `occurrenceMap` are for one and the same prefix. — greybeard, Dec 16 '16 at 09:08
Sketchy assessment of complexity: This seems to invest constant time for each recurrence observed. If the task was to _identify_ each occurrence of a _shared substrings in string 2_ (output sensitive problem), this would be optimal. For _reporting counts_, the problematic case seems to be _string repeated immediately more than once_: take `abababab` in both strings - there are two occurrences of `ababab`, three of `abab` and four of `ab` for every occurrence of `(ab)⁴` in _string 2_, quadratic complexity here. — greybeard, Dec 17 '16 at 11:31

David Eisenstat · Accepted Answer · 2016-12-20T19:43:11.017

Here's a C implementation based on traversing the suffix array of the concatenation of the inputs, with the help of the longest common prefix array. You can replace the programming-contest-grade (O(n log^2 n)) suffix array implementation with a real one (O(n) or O(n log n)) for a large performance improvement. (EDIT: did this, with some other changes reflecting the asker's new requirements: https://github.com/eisenstatdavid/commonsub .)

#include <inttypes.h>
#include <limits.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>

typedef int_fast32_t I32;

#define Constrain(expression) _Static_assert(expression, #expression)
Constrain(CHAR_BIT == 8);
#define InputMaxBytes 80000000
Constrain(InputMaxBytes <= (INT_LEAST32_MAX - 2) / 2);
#define MaxLen (2 * InputMaxBytes + 2)
Constrain(MaxLen <= INT_FAST32_MAX / 2);

static I32 Len;
static I32 Begin2;
static signed char Buf[MaxLen];
static int_least32_t SufArr[MaxLen];
static int_least32_t SufRank[MaxLen];
static int_least32_t NewRank[MaxLen];
static int_least32_t *const LongCommPre = NewRank;  // aliased to save space
static uint_least64_t Bitmap2[(MaxLen >> 6) + 1];
static int_least32_t SparseCount2[(MaxLen >> 6) + 1];
static int_least32_t *const Stack = SufRank;  // aliased to save space

static void Slurp(const char *filename) {
  FILE *stream = fopen(filename, "r");
  if (stream == NULL) goto fail;
  I32 n = fread(Buf + Len, sizeof *Buf, InputMaxBytes + 1, stream);
  if (ferror(stream)) goto fail;
  if (n > InputMaxBytes) {
    fprintf(stderr, "%s: file is too large; increase InputMaxBytes\n",
            filename);
    exit(EXIT_FAILURE);
  }
  for (I32 i = 0; i < n; i++) {
    if (Buf[Len + i] < 0) {
      fprintf(stderr,
              "%s: file contains non-ASCII byte at offset %" PRIdFAST32 "\n",
              filename, i);
      exit(EXIT_FAILURE);
    }
  }
  Len += n;
  if (fclose(stream) == EOF) goto fail;
  return;
fail:
  perror(filename);
  exit(EXIT_FAILURE);
}

static I32 Radix;

static int CompareRankPairs(const void *iPtr, const void *jPtr) {
  I32 i = *(const int_least32_t *)iPtr;
  I32 j = *(const int_least32_t *)jPtr;
  if (SufRank[i] < SufRank[j]) return -1;
  if (SufRank[i] > SufRank[j]) return 1;
  I32 iRank = i + Radix < Len ? SufRank[i + Radix] : -2;
  I32 jRank = j + Radix < Len ? SufRank[j + Radix] : -2;
  if (iRank < jRank) return -1;
  if (iRank > jRank) return 1;
  return 0;
}

static void BuildSuffixArray(void) {
  for (I32 i = 0; i < Len; i++) {
    SufArr[i] = i;
    SufRank[i] = Buf[i];
  }
  for (Radix = 1; true; Radix *= 2) {
    qsort(SufArr, Len, sizeof *SufArr, CompareRankPairs);
    NewRank[0] = 0;
    for (I32 i = 1; i < Len; i++) {
      NewRank[i] = CompareRankPairs(&SufArr[i - 1], &SufArr[i]) == 0
                       ? NewRank[i - 1]
                       : NewRank[i - 1] + 1;
    }
    for (I32 i = 0; i < Len; i++) {
      SufRank[SufArr[i]] = NewRank[i];
    }
    if (NewRank[Len - 1] == Len - 1) break;
  }

  I32 lenCommPre = 0;
  for (I32 i = 0; i < Len; i++) {
    if (SufRank[i] == Len - 1) {
      LongCommPre[SufRank[i]] = -1;
      continue;
    }
    while (Buf[i + lenCommPre] == Buf[SufArr[SufRank[i] + 1] + lenCommPre]) {
      lenCommPre++;
    }
    LongCommPre[SufRank[i]] = lenCommPre;
    if (lenCommPre > 0) lenCommPre--;
  }
}

static I32 PopCount(uint_fast64_t x) {
  I32 v = 0;
  while (x != 0) {
    x &= x - 1;
    v++;
  }
  return v;
}

static void BuildCumCount2(void) {
  for (I32 i = 0; i < Len; i++) {
    if (SufArr[i] >= Begin2) {
      Bitmap2[i >> 6] |= UINT64_C(1) << (i & 63);
      SparseCount2[i >> 6]++;
    }
  }
  for (I32 i = 0; i < (Len >> 6); i++) {
    SparseCount2[i + 1] += SparseCount2[i];
  }
}

static I32 CumCount2(I32 i) {
  return SparseCount2[i >> 6] - PopCount(Bitmap2[i >> 6] >> (i & 63));
}

static void FindCommonStrings(void) {
  I32 lenCommPre = -1;
  for (I32 i = 0; i < Len; i++) {
    while (lenCommPre > LongCommPre[i]) {
      I32 begin = Stack[lenCommPre];
      I32 end = i + 1;
      I32 count2 = CumCount2(end) - CumCount2(begin);
      if (count2 > 0 && count2 < end - begin && lenCommPre > 0) {
        printf("%" PRIdFAST32 "\t%.*s\n", count2, (int)lenCommPre,
               Buf + SufArr[begin]);
      }
      lenCommPre--;
    }
    while (lenCommPre < LongCommPre[i]) {
      lenCommPre++;
      Stack[lenCommPre] = i;
    }
  }
}

int main(int argc, char *argv[]) {
  if (argc != 3) {
    fputs("usage: commonsub needle haystack\n", stderr);
    exit(EXIT_FAILURE);
  }
  Len = 0;
  Slurp(argv[1]);
  Buf[Len] = -1;
  Len++;
  Begin2 = Len;
  Slurp(argv[2]);
  Buf[Len] = -2;  // sentinel
  BuildSuffixArray();
  if (false) {
    for (I32 i = 0; i < Len; i++) {
      printf("%" PRIdFAST32 "\t%" PRIdLEAST32 "\t%" PRIdLEAST32 "\t%.*s\n", i,
             SufArr[i], LongCommPre[i], (int)(Len - SufArr[i]),
             Buf + SufArr[i]);
    }
  }
  BuildCumCount2();
  FindCommonStrings();
}

`(O(n log^2 n)) suffix array implementation` ouch - was just trying to figure that out. Still much faster _with "repetitive/powerstrings"_ (until benchmark suggests otherwise). — greybeard, Dec 18 '16 at 23:09
@greybeard Yep, but the right thing is to use a library that I don't have permission to post here. — David Eisenstat, Dec 18 '16 at 23:16
@DavidEisenstat Something like Ukkonen's algorithm for O(n)? — Braydon Batungbacal, Dec 19 '16 at 08:30
@BraydonBatungbacal See https://github.com/eisenstatdavid/commonsub for the code with the sais suffix array library. — David Eisenstat, Dec 20 '16 at 15:07

Francis Cugler · Answer 3 · 2016-12-17T16:06:32.023

After looking at the two strings and thinking about this for a bit I've done this procedure in my head and now I'm going to translate it into steps.

String 1: ABCDE512ABC361EG51D  // S1
String 2: ADE5AHDW4131EG1DG5C  // S2

When I was thinking about this we can compare characters and or substrings from S1 to S2 while keeping track of occurrences.

S1[0] = 'A'  compare S2[0]  = 'A' = true : found A in S2 at location 0
S1[0] = 'A'  compare S2[1]  = 'D' = false
S1[0] = 'A'  compare S2[2]  = 'E' = false
S1[0] = 'A'  compare S2[3]  = '5' = false
S1[0] = 'A'  compare S2[4]  = 'A' = true : found A in S2 at location 4
S1[0] = 'A'  compare S2[5]  = 'H' = false
S1[0] = 'A'  compare S2[6]  = 'D' = false
S1[0] = 'A'  compare S2[7]  = 'W' = false
S1[0] = 'A'  compare S2[8]  = '4' = false
S1[0] = 'A'  compare S2[9]  = '1' = false
S1[0] = 'A'  compare S2[10] = '3' = false
S1[0] = 'A'  compare S2[11] = '1' = false; 
S1[0] = 'A'  compare S2[12] = 'E' = false; 
S1[0] = 'A'  compare S2[13] = 'G' = false;
S1[0] = 'A'  compare S2[14] = '1' = false;
S1[0] = 'A'  compare S2[15] = 'D' = false;
S1[0] = 'A'  compare S2[16] = 'G' = false;
S1[0] = 'A'  compare S2[17] = '5' = false;
S1[0] = 'A'  compare S2[18] = 'C' = false;

// End of First Search - Occurrences of 'A' in S2 is 2 at locations {0,4}

// Next Iteration
String 1: ABCDE512ABC361EG51D  // S1
String 2: ADE5AHDW4131EG1DG5C  // S2

// Repeat this for all single characters Of S1 against S2
'A' in S2 = 2  at {0,4}
'B' in S2 = 0 
'C' in S2 = 1  at {18}
'D' in S2 = 3  at {1,6,15}
'E' in S2 = 2  at {2,12}
'5' in S2 = 2  at {3,17}
'1' in S2 = 3  at {9,11,14}
'2' in S2 = 0
'A' Already Found Above Skip
'B' Already Found Above Skip
'C' Already Found Above Skip
'3' in S2 = 1  at {10}
'6' in S2 = 0
'1' Already Found Above Skip
'E' Already Found Above Skip
'G' in S2 = 2  at {13, 16}
'5' Already Found Above Skip
'1' Already Found Above Skip
'D' Already Found Above Skip

This would conclude the first set of iterations for doing all single characters and as you can see we also built a list and a map or sets of not only occurrences but also their locations and we can store them for future references. So if we begin to search for S1[0 & 1] in S2 we know that S1[1] does not exist in S2 so we can break and don't need to go down that chain and since we can break out of that branch we can also skip over doing S1[1 & ...N] and move directly to S1[2] and we know that there is only 1 occurrence of S1[2] which is 'C' in S2 located at {18} which is the end of the string so there is no need to look for S1[2 & ... N] so we can skip over this and move to S1[3] which is 'D' and we know that it does exist in S2 at {1,6,15} so now we can begin our search of S1[3 & ... N] beginning with S2[1 & ... N] then again do the same search of S1[3 & ... N] starting at S2[6 & ... N] and finally again starting S2[15 & ...N] then we have now found all sub strings that start with D in S2 and we can save their occurrences; however this is were we do want to find the longest substring between the two. The longest sub string is "DE5" and there is only one occurrence of it, but from this we have also already found the sub strings "DE" & "E5" so we can search for them at this point as well and we then find that there are 1 occurrence of each. And we just repeat this process. It will take sort of a long time at first, but the more you traverse through the strings, the faster it will work because of eliminating already found occurrences as well as skipping over non found sub strings of S1 in S2.

This is the logical approach that I took without using any code or programming semantics for it is just the basic algorithm of doing this logically. It now becomes a matter of determination to put this into functions and containers to write a source code implementation of it.

EDIT - As asked in the comments about the difference of this versus another's answer and with the time & space complexity here is a version of my algorithm doing the first pass searching for single characters and creating the tables of positions and if they exist in the 2nd string. The stored vector in the class contains each unique character in S1 within S2. This can then be used to help find longer substrings.

// C++ - The user asked for this in C but I haven't used C in nearly 10 years so this is my version of it in C++ :( 
#include <string>
#include <vector>

class SubStringSearch {
private:
    std::string S1;
    std::string S2; 

    struct SubstringResult {
        std::string substring;
        bool found;
        std::vector<unsigned> positions;

        SubstringResult(){}
        SubstringResult( const std::string& substringIn, bool foundIn, std::vector<unsigned> positionsIn ) :
            substring( substringIn ), found( foundIn ), positions( positionsIn ) {}
    };

    std::vector<SubstringResult> results;

public:
    SubStringSearch( const std::string& s1, const std::string& s2 ) : S1( s1 ), S2( s2 ) {}

    void compareStringsFirstPass();
    std::vector<unsigned> findLocations( const std::string& str, char findIt );
    void printResults() const;

};

std::vector<unsigned> SubStringSearch::findLocations( const std::string& str, char findIt ) {
    std::vector<unsigned> locations;
    for ( unsigned i = 0; i < str.size(); ++i ) {
        if ( str[i] == findIt ) {
            locations.push_back( i );
        }
    }
    return locations;
}

void SubStringSearch::compareStringsFirstPass() {
    std::vector<unsigned> positions;
    std::string sub;
    bool alreadyFound = false;

    for ( unsigned idx = 0; idx < S1.size(); ++idx ) {
        sub = S1[idx];

        if ( idx > 0 ) {
            for ( unsigned u = 0; u < results.size(); ++u ) {
                if ( sub == results[u].substring ) {
                    alreadyFound = true;
                    break;
                }
            }
        }

        // Added An If Else Here To Reduce Unneeded Calls To findLocations()
        if ( alreadyFound ) {
            alreadyFound = false;
            continue;
        } else {
            positions = findLocations( S2, S1[idx] );
        }

        if ( positions.size() > 0 && !alreadyFound ) {
            results.push_back( SubstringResult( sub, true, positions ) );
        } else if ( !alreadyFound ) {
            positions.clear();
            results.push_back( SubstringResult( sub, false, positions ) );
        }

        positions.clear();
        alreadyFound = false;
    }
}

void SubStringSearch::printResults() const {
    for ( unsigned u = 0; u < results.size(); ++u ) {
        if ( results[u].found ) {
            std::cout << results[u].substring << " found in S2 at " << std::setw(2);
            for ( unsigned i = 0; i < results[u].positions.size(); ++i ) {
                std::cout << std::setw(2) << results[u].positions[i] << " ";
            }
            std::cout << std::endl;
        }
    }
}

int main() {
    std::string S1( "ABCDE512ABC361EG51D" );
    std::string S2( "ADE5AHDW4131EG1DG5C" );

    SubStringSearch searchStrings( S1, S2 );
    searchStrings.compareStringsFirstPass();

    std::cout << "break point";

    return 0;
} // main

Place a break point on that last print line and go into your debugger for either your locals or your autos in MSVC or something equivalent for your version of your compiler / debugger and check out the contents of the class's member variable that is a std::vector and you will see the character from S1 and attached to it will be a bool flag if it is found or not as well as a std::vector for each of the positions. So if the flag is false then the vector size should be 0 and vise versa if the vector size is > 0 then the flag should be true; also the size of the vector of positions is also the count or the occurrences of that character in the 2nd string which makes this nice because we don't have to calculate anything else we can just get that from the vector itself.

Now this is not the complete or full algorithm as this is only the first pass of doing each single character of string 1 and looking into string 2 while building the needed table and skipping over contents that have already been found. It will be up to the OP to build upon this if they so choose to complete the rest of the algorithm. If I happen to find some free time in the near future I may go ahead and complete the full algorithm.

(How is this different from the approach presented by [Nico Schertler](http://stackoverflow.com/a/40434453/3789665)? What is the worst case time & space complexity?) — greybeard, Dec 17 '16 at 10:44
@greybeard Nico gave his answer and showed working c++ code. I was just giving a general idea or an approach to break down such a problem in order to construct or to devise an algorithm. This is just the logical representation of it. — Francis Cugler, Dec 17 '16 at 12:24
@greybeard As for the time & space complexity it depends on the size of both strings. Of course the first pass to compare each character against each character is going to be [m^n] complexity where m has not already been found in n and where m is not in n. This gives the tables, matches and occurrences as well as their positions in the 2nd string. We can then use that table of sets to find all multi character substrings searching for the longest and for each one we find, we can skip over smaller subs strings of those longer ones. This will decrease the time & space complexity on each pass. — Francis Cugler, Dec 17 '16 at 12:53
@greybeard I added my version of the algorithm for the 1st pass to search individual characters of string S1 in string S2 and recorded if they were found and at which positions while removing unnecessary searches and function calls. Also by storing the locations of the finds; this also gives the amount of occurrences of those individual characters. I may at a future date build upon this to give the full implementation or the completed version of the algorithm. — Francis Cugler, Dec 17 '16 at 15:36

score 0 · Answer 4 · answered Dec 21 '16 at 20:34

From what I can understand, breaking up the string to all possible sub-strings is in itself an O(n*n) operation.

abcd
====
a,b,c,d
ab,bc,cd
abc,bcd
abcd
************************
abcdefgh
========
a,b,c,d,e,f,g,h
ab,bc,cd,de,ef,fg,gh
abc,bcd,cde,def,efg,fgh
abcd,bcde,cdef,defg,efgh
abcde,bcdef,cdefg,defgh
abcdef,bcdefg,cdefgh
abcdefg,bcdefgh
abcdefgh

As such, it doesn't look like a solution in linear time is possible.

Further more to actually solve it, from a Java language perspective, you'd have to first break it up and store it in a set or a map (map can have substring as key and the number of occurrences as count).

Then repeat the step for the second string as well.

Then you can iterate over the first, checking if the entry exists in the second string's map and also increment the number of occurrences for that sub-string in parallel.

If you are using 'C', then you can try sorting the array of sub-strings and then use binary search to find matches (while having a two-dimensional array to keep track of the string and the count of occurrences).

You said you had a tree approach that ran faster. Do you mind posting a sample so as to how you used a tree ? Was it for representing the sub-strings or to help generate it?

Algorithm for finding all of the shared substrings of any length between 2 strings, and then counting occurrences in string 2?

4 Answers4

Linked

Related