Number of distinct substrings with given prefix and suffix

Question

Suppose I am given a string S.

I need to find the number of distinct substrings of S that contain S1 as the prefix and S2 as the suffix.

The range of S, S1 and S2 can be very large, that is, O(10^5).

For eg.

Suppose S is "abcdcd", S1 is "ab" and S2 is "cd".

The distinct substrings of "ababcdcd" are: "a", "b", "c", "d", "ab", "bc", "cd", "dc", "abc", "bcd", "cdc", "dcd", "abcd", "bcdc", "cdcd", "abcdc", "bcdcd", "abcdcd". The count of total distinct substrings can be easily found using Suffix Array. I am trying to extend the same idea to solve the question.

Out of these substrings, the substrings containing "ab" as prefix and "cd" as suffix are: "abcd", "abcdcd".

Thus the answer is 2.

PS: I believe it uses the Suffix Array but I am not sure how. Please help.

What is your exact question? What did you achived so far? Any thoughts, algorithms or code snippets? — Boris Brodski, Mar 29 '14 at 23:57
Is this a more expanded duplicate of [this question](http://stackoverflow.com/questions/22738099/find-distinct-substrings-starting-with-the-substring-x-and-ending-with-y)? — G. Bach, Mar 30 '14 at 00:38

Deduplicator · Answer 1 · 2014-03-30T01:58:59.240

0

The solution is simple:

Build a list of all occurences of the prefix.
Build a list of all occurences of the suffix.
Count valid combinations.

Complexity: O(#S+#S1)+O(#S+#S2)+O(#found(S1)+#found(S2))

Optional instead for omitting those arrays:

startpos, endpos, startcount, ret = -1, -1, 0, 0
while startpos = find new embedding of S1 after startpos
  while (endpos-startpos)<max(#S1,#S2)
    if not endpos = find new embedding of S2 after endpos
      return ret
    ret = ret + startcount
  startcount = startcount + 1
return ret

Complexity should now be O(2*#S+#S1+#S2). But I'm not sure...

edited Mar 30 '14 at 01:58

answered Mar 30 '14 at 00:27

Deduplicator

44,692
7
66
118

How are you avoiding the duplicates? I believe your method will do over-counting. – user3476953 Mar 30 '14 at 00:33
That's why you build both arrays. Then you do this: Set first stop marker to earliest stop. Loop over all starts from beginning. Move the stop marker until you have a valid substring. Add all possible substrings with current startstring to the total. – Deduplicator Mar 30 '14 at 00:36
I am sorry but I unable to understand your algorithm properly. Can you please add a short example to clarify things a bit? – user3476953 Mar 30 '14 at 00:45
How about this now? BTW: I used an efficient O(n+m) string matching algorithm as a building block. THey exist, but I'm not quite sure just now how they work... – Deduplicator Mar 30 '14 at 00:56
I think it's not quite as simple. You need to only count those substrings that are actually longer than `max{|S1|, |S2|}`. For example check what happens for `S = aaaaaaaaaa, S1 = aaa, S2 = aaa` – Niklas B. Mar 30 '14 at 01:56
ok, now S1 and S2 can be similar... Should not have described it this sloppy even in pseudocode, sry. BTW: No suffix-array to be seen. – Deduplicator Mar 30 '14 at 02:00
The idea is sound, but the implementation is not quite so straightforward. We would probably need to keep two pointers into the string, one to the current endpoint of S2 and the other one to the rightmost possible corresponding startpoint of S1. Then we can use the counters you propose. Yes, suffix arrays are overkill here – Niklas B. Mar 30 '14 at 02:02
startpos and endpos are our pointers into the string. THe rightmost possible startpoint is implicitly found by "startpos = find new embedding". Precomputing it slows everything down. This code should work as is. – Deduplicator Mar 30 '14 at 02:05

Number of distinct substrings with given prefix and suffix

1 Answers1