7

Is there any efficient way to find the duplicate substring? Here, duplicate means that two same substring close to each other have the same value without overlap. For example, the source string is:

ABCDDEFGHFGH

'D' and 'FGH' is duplicated. 'F' appear two times in the sequence, however, they are not close to each other, so it does not duplicate. so our algorithm will return ['D', 'FGH']. I want to know whether there exists an elegant algorithm instead the brute force method?

maple
  • 1,828
  • 2
  • 19
  • 28
  • can you explain what you mean by 'duplicate' , in your example 'duplicate' is sub string that returns herself right after the first one , but duplicate means substring that return herself at any place, for example in your case also the letter F is duplicate..so try to be specific so we can help you – Developer Dec 22 '16 at 11:29
  • 1
    @Oriel.F I'm sorry for the confusion. Is it clear now? – maple Dec 22 '16 at 11:35
  • 1
    What's the right answer for `AAAA`? Perhaps it's `['A', 'AA']` as a set, but do we need to account for duplicate `A` appearing three times? – Gassa Dec 22 '16 at 11:58
  • As continuation of Gassa's question, what is the answer to : ABABABAB, is it {'AB','ABAB'} or just ABAB? I mean do we consider substrings of a longer string? – Saeed Amiri Dec 22 '16 at 12:56

3 Answers3

5

It relates to Longest repeated substring problem, which builds Suffix Tree to provide string searching in linear time and space complexity Θ(n)

Muhammad Faizan Uddin
  • 1,339
  • 12
  • 29
  • Thanks for your answer. I'm sorry for not state the question clearly. It is required that the two substring is close to each other without overlap. Can this algorithm expand to fulfil the requirement? – maple Dec 22 '16 at 11:37
  • @maple: yes; as a first try I recommend using *suffix array* which is easier to construct, but has `O(n*log(n))` time complexity (i.e. slightly slower than suffix tree). – Dmitry Bychenko Dec 22 '16 at 11:41
1

Not very efficient (suffix tree/array are better for very large strings), but very short regular expression solution (C#):

  string source = @"ABCDDEFGHFGH";

  string[] result = Regex
    .Matches(source, @"(.+)\1")
    .OfType<Match>()
    .Select(match => match.Groups[1].Value)
    .ToArray(); 

Explanation

(.+) - group of any (at least 1) characters
\1   - the same group (group #1) repeated 

Test

  Console.Write(string.Join(", ", result));     

Outcome

  D, FGH

In case of ambiguity, e.g. "AAAA" where we can provide "AA" as well as "A" the solution performs greedy and thus "AA" is returned.

Dmitry Bychenko
  • 180,369
  • 20
  • 160
  • 215
1

Without using any regex which might turn out to be very slow, I guess it's best to use two cursors running hand to hand. The algorithm is pretty obvious from the below JS code.

function getNborDupes(s){
  var cl = 0,  // cursor left
      cr = 0,  // cursor right
      ts = "", // test string
     res = []; // result array
  while (cl < s.length){
    cr = cl;
    while (++cr < s.length){
      ts = s.slice(cl,cr);  // ts starting from cl to cr (char @ cr excluded)
      
      // check ts with subst from cr to cr + ts.length (char @ cr + ts.length excluded)
      // if they match push it to result advance cursors to cl + ts.length and continue
      
      ts === s.substr(cr,ts.length) && (res.push(ts), cl = cr += ts.length);
    }
  cl++;
  }
  return res;
}

var str = "ABCDDEFGHFGH";
console.log(getNborDupes(str));

Throughout the whole process ts will take the following values.

A
AB
ABC
ABCD
ABCDD
ABCDDE
ABCDDEF
ABCDDEFG
ABCDDEFGH
ABCDDEFGHF
ABCDDEFGHFG
B
BC
BCD
BCDD
BCDDE
BCDDEF
BCDDEFG
BCDDEFGH
BCDDEFGHF
BCDDEFGHFG
C
CD
CDD
CDDE
CDDEF
CDDEFG
CDDEFGH
CDDEFGHF
CDDEFGHFG
D
E
EF
EFG
EFGH
EFGHF
EFGHFG
F
FG
FGH

Though the cl = cr += ts.length part decides whether or not to re-start searching on from before or after the matching sub-string. As of currently the above code; "ABABABAB" input would return ["AB","AB"] for but if you make it cr = cl += ts.length then you should expect the result to be ["AB", "AB", "AB"].

Redu
  • 25,060
  • 6
  • 56
  • 76
  • Thanks for the answer. It is inspirational, while the code might not be right. I tried "1ABC2ABC33" and the outputs only have ['3']. It looks the features of "cr" is overloaded. It cannot be "end-of-test-string" and "cursor-of-sub-string" at the same time. IMHO, the procedure requires 3 nested loop. – Morgan Cheng Sep 30 '19 at 02:53
  • @Morgan Cheng This algorithm finds the neighboring dupes so it is normal that it returns only 3. – Redu Oct 01 '19 at 18:02