4

I am trying to search for the maximal number of substring repetitions inside a string, here are some few examples:

"AQMQMB" => QM (2x)
"AQMPQMB" => <nothing>
"AACABABCABCABCP" => A (2x), AB (2x), ABC (3x)

As you can see I am searching for consecutive substrings only and this seems to be a problem because all compression algorithms (at least that I am aware of) don't care about the consecutivity (LZ*), or too simple to handle consecutive patterns instead of single data items (RLE). I think using suffix tree-related algorithms is also not useful due to the same problem.

I think there are some bio-informatics algorithms that can do this, does anyone have an idea about such algorithm?

Edit In the second example there might be multiple possibilities of consecutive patterns (thanks to Eugen Rieck for the notice, read comments below), however in my use case any of these possibilities is actually acceptable.

Community
  • 1
  • 1
Y.H.
  • 2,687
  • 1
  • 29
  • 38
  • I see a min length of 2 to qualify from your example, is there a max length also? This influences the memory needs. – Eugen Rieck Nov 28 '12 at 11:20
  • No there is no minimum/maximum length constraints, I will modify the output of the examples, thanks for notifying me – Y.H. Nov 28 '12 at 11:22
  • 2
    Are you sure about the results of the example? I would add 'CAB' and 'BCA' (2x each) to the result list! – Eugen Rieck Nov 28 '12 at 11:39
  • I think the string can have multiple combinations of different consecutive patterns as it probably depends on how the algorithm will work, for example whether it uses table of candidate matches or traversing a suffix tree, I am not quite sure yet but I just want to indicate the any of those combinations is fine by my use case. – Y.H. Nov 28 '12 at 11:51

2 Answers2

3

Suffix tree related algorithms are useful here.

One is described in Algorithms on Strings, Trees and Sequences by Dan Gusfield (Chapter 9.6). It uses a combination of divide-and-conquer approach and suffix trees and has time complexity O(N log N + Z) where Z is the number of substring repetitions.

The same book describes simpler O(N2) algorithm for this problem, also using suffix trees.

Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
3

This is what I used for a similar problem:

<?php

$input="AACABABCABCABCP";

//Prepare index array (A..Z) - adapt to your character range
$idx=array();
for ($i="A"; strlen($i)==1; $i++) $idx[$i]=array();

//Prepare hits array
$hits=array();

//Loop
$len=strlen($input);
for ($i=0;$i<$len;$i++) {

    //Current character
    $current=$input[$i];

    //Cycle past occurrences of character
    foreach ($idx[$current] as $offset) {

        //Check if substring from past occurrence to now matches oncoming
        $matchlen=$i-$offset;
        $match=substr($input,$offset,$matchlen);
        if ($match==substr($input,$i,$matchlen)) {
            //match found - store it
            if (isset($hits[$match])) $hits[$match][]=$i;
            else $hits[$match]=array($offset,$i);
        }
    }

    //Store current character in index
    $idx[$current][]=$i;
}

print_r($hits);

?>

I suspect it to be O(N*N/M) time with N being string length and M being the width of the character range.

It outputs what I think are the correct answers for your example.

Edit:

This algo hast the advantage of keeping valid scores while running, so it is usable for streams, asl long as you can look-ahaead via some buffering. It pays for this with efficiency.

Edit 2:

If one were to allow a maximum length for repetition detection, this will decrease space and time usage: Expelling too "early" past occurrences via something like if ($matchlen>MAX_MATCH_LEN) ... limits index size and string comparison length

Eugen Rieck
  • 64,175
  • 10
  • 70
  • 92
  • 1
    I think time complexity is O(N^3/M) - two loops and one string comparison. – Evgeny Kluev Nov 28 '12 at 12:02
  • While you are right, that the String comparison gives more complexity, I wouldn't think of it as O(N): It ever compares only substrings shorter than the current index, not the total length of the string. It adds O(matchlen), which is guaranteed to be well below O(N) – Eugen Rieck Nov 28 '12 at 12:08
  • Could you please provide more references to this algorithm? – Y.H. Nov 28 '12 at 15:15
  • There is no formal reference AFAIK, it started as a quick hack, inspired by string matching via lookback references. The idea is, that every repetition of a pattern must start with the same character. So I keep an index of past occurences (inifinte or finite in edit 2) and for every char seen, the index gives me a list of possible starting points for a repetition. The idea is, that these reference lists will be relativly short, thus saving comparisons. – Eugen Rieck Nov 28 '12 at 15:54