Regex expression for matching all duplicate substrings of any length

Question

Let's say we have a string: "abcbcdcde"

I want to identify all substrings that are repeated in this string using regex (i.e. no brute-force iterative loops).

For the above string, the result set would be: {"b", "bc", "c", "cd", "d"}

I must confess that my regex is far more rusty than it should be for someone with my experience. I tried using a backreference, but that'll only match consecutive duplicates. I need to match all duplicates, consecutive or otherwise.

In other words, I want to match any character(s) that appears for the >= 2nd time. If a substring occurs 5 times, then I want to capture each of occurrences 2-5. Make sense?

This is my pathetic attempt thus far:

preg_match_all( '/(.+)(.*)\1+/', $string, $matches );  // Way off!

I tried playing with look-aheads but I'm just butchering it. I'm doing this in PHP (PCRE) but the problem is more or less language-agnostic. It's a bit embarrassing that I'm finding myself stumped on this.

And you're sure that this can be done with regular expressions? :) — Ja͢ck, Dec 14 '12 at 07:57
No, I'm not. In fact, I can't find any evidence that it can. I guess I was hoping that my lack of success was merely the product of me not being smart enough with regex lol; that someone more experienced with it would have the magic answer that I was missing. From the answers, though, it looks like it wasn't just me after all. — Kris Craig, Dec 14 '12 at 11:40
Let's say the string in question is 20,000 characters in length and consists only of letters. Brute-force iteration would be prohibitively slow even on good hardware. On the other hand, it looks like it can't be done with a magic regex pattern. So, what do you guys think would be the best way to accomplish this task in the shortest amount of execution time possible? I'm too tired to think anymore tonight but I'll be interested to see if anyone comes up with a better approach than the one I think of. =) — Kris Craig, Dec 14 '12 at 11:45
So are we looking at 20k chars or longer? And it would help to somewhat formalize the desired runtime / performance you're expecting. Oh, and would you need the frequencies as well? — Ja͢ck, Dec 14 '12 at 12:29
Basically, I need it to be as fast as possible, whatever that may be. — Kris Craig, Dec 16 '12 at 11:35
Here's my question: Since regex and backtracing are no good, can you think of any other possible way of doing this that doesn't involve manually looping through the entire string to identify and remove the duplicates? — Kris Craig, Dec 16 '12 at 11:37
I can't think of any optimization that can be applied in this case. — Ja͢ck, Dec 16 '12 at 12:01
Both Jack and Tim's answers were accurate. Tie-breaker went to Jack for going that extra mile with the perf data. =) — Kris Craig, Dec 22 '12 at 02:57
To summarize for anyone finding this on Google or whatever, what I was hoping to accomplish is apparently not possible. I was hoping it was and that I just wasn't seeing it, but no such luck. There's no way to do this without brute-force iterations, unfortunately. Well, at least we thoroughly established that lol. =) — Kris Craig, Dec 22 '12 at 02:59
And thanks to all of you who worked your brains trying to find that elusive magic solution! — Kris Craig, Dec 22 '12 at 03:01

score 9 · Accepted Answer · edited May 23 '17 at 11:43

9

Your problem is recursi ... you know what, forget about recursion! =p it wouldn't really work well in PHP and the algorithm is pretty clear without it as well.

  function find_repeating_sequences($s)
  {
    $res = array();
    while ($s) {
        $i = 1; $pat = $s[0];
        while (false !== strpos($s, $pat, $i)) {
            $res[$pat] = 1;
            // expand pattern and try again
            $pat .= $s[$i++];
        }
        // move the string forward
        $s = substr($s, 1);
    }
    return array_keys($res);
  }

Out of interest, I wrote Tim's answer in PHP as well:

function find_repeating_sequences_re($s)
{
    $res = array();
    preg_match_all('/(?=(.+).*\1)/', $s, $matches);
    foreach ($matches[1] as $match) {
        $length = strlen($match);
        if ($length > 1) {
            for ($i = 0; $i < $length; ++$i) {
                for ($j = $i; $j < $length; ++$j) {
                    $res[substr($match, $i, $j - $i + 1)] = 1;
                }
            }
        } else {
            $res[$match] = 1;
        }
    }
    return array_keys($res);
}

I've let them fight it out in a small benchmark of 800 bytes of random data:

$data = base64_encode(openssl_random_pseudo_bytes(600));

Each code is run for 10 rounds and the execution time is measured. The results?

Pure PHP      - 0.014s (10 runs)
PCRE          - 40.86s <-- ouch!

It gets weirder when you look at 24k bytes (or anything above 1k really):

Pure PHP      - 4.565s (10 runs)
PCRE          - 0.232s <-- WAT?!

It turns out that the regular expression broke down after 1k characters and so the $matches array was empty. These are my .ini settings:

pcre.backtrack_limit => 1000000 => 1000000
pcre.recursion_limit => 100000 => 100000

It's not clear to me how a backtrack or recursion limit would have been hit after only 1k of characters. But even if those settings are "fixed" somehow, the results are still obvious, PCRE doesn't seem to be the answer.

I suppose writing this in C would speed it up somewhat, but I'm not sure to what degree.

Update

With some help from hakre's answer I put together an improved version that increases performance by ~18% after optimizing the following:

Remove the substr() calls in the outer loop to advance the string pointer; this was a left over from my previous recursive incarnations.
Use the partial results as a positive cache to skip strpos() calls inside the inner loop.

And here it is, in all its glory (:

function find_repeating_sequences3($s)
{
    $res = array(); 
    $p   = 0;
    $len = strlen($s);

    while ($p != $len) {
        $pat = $s[$p]; $i = ++$p;
        while ($i != $len) {
            if (!isset($res[$pat])) {
                if (false === strpos($s, $pat, $i)) {
                    break;
                }
                $res[$pat] = 1;
            }
            // expand pattern and try again
            $pat .= $s[$i++];
        }
    }
    return array_keys($res);
}

edited May 23 '17 at 11:43

Community

1
1

answered Dec 14 '12 at 08:20

Ja͢ck

170,779
38
263
309

Do be warned that PHP has a recursion limit of about 100. Excessively interesting strings are gonna die in the recursive function `rep`. – Charles Dec 14 '12 at 08:38
1

@Charles And use the Y-combinator? :) – Ja͢ck Dec 14 '12 at 08:38
oh I wish. No, it then occurred to me that you can't have an anonymous function call itself without making itself part of the `use` clause and then it just begins getting *weird*. – Charles Dec 14 '12 at 08:38
1

Actually, now that I've rewritten it by taking out one of the recursive calls, I might as well make it iterative :) – Ja͢ck Dec 14 '12 at 08:42
Doing it via an iterative loop (i.e. brute force) like the one you posted would be easy for me to do. I was hoping there was some regex trick I could use to do this in one shot without having to recurse all the substrings. – Kris Craig Dec 14 '12 at 11:30
I have a brute force loop I've been using for testing the outputs and it can be done without having to rely on functional recursion (which, as you pointed out, is limited in PHP). However, the actual string inputs we're dealing with can be anywhere from 3 to tens of thousands of characters (can't remember the precise scope off the top of my head). Iterating through the whole string takes way too long to be usable in a production environment. – Kris Craig Dec 14 '12 at 11:34
@KrisCraig: Any regex solution will have to do the same brute-force iterative approach, too. You're just "outsourcing" that to the regex engine - the computational complexity is exactly the same. – Tim Pietzcker Dec 14 '12 at 11:43
Yes, but in PHP, that can make a world of difference. Anything you can delegate to the lower level will yield better performance compared to putting it in the scripting layer. PCRE is pretty well optimized for performance as it is, so I definitely wouldn't discount letting it handle the heavy lifting. – Kris Craig Dec 14 '12 at 11:51
@KrisCraig I agree, *if* it's actually possible with PCRE, but the problem definition is *not* regular. As Tim pointed out it will not return subsets of repeated elements, so you still have to calculate the permutations yourself. – Ja͢ck Dec 14 '12 at 12:16
Wow, I have to admit, those perf results for PCRE are quite startling! When I was at Microsoft, I was in charge of perf testing the official Windows PHP builds. Clearly, this is something I missed! I'll bring it up on Internals sometime and see if anyone knows why it's performing so badly. – Kris Craig Dec 22 '12 at 03:04
Oh and PHP is written in ANSI C, as is PCRE, so unless there's a bug in PHP's interface with PCRE, I doubt writing the code for this in C would make any meaningful difference. – Kris Craig Dec 22 '12 at 03:06
Jack, what version of PHP were you using and what was the environment it was running in? I think those numbers are bad enough that this should be looked into. – Kris Craig Dec 22 '12 at 03:08
@Charles: REcursion limit 100 only by default xdebug setting, php goes until segfault, see as well: http://stackoverflow.com/questions/7327393/why-does-an-infinitely-recursive-function-in-php-cause-a-segfault – hakre Dec 22 '12 at 03:09
@KrisCraig It's PHP 5.3.10 with Suhosin-Patch running on Mac; I can try different configurations if you'd like. – Ja͢ck Dec 22 '12 at 03:09
@hakre This is not PHP recursion we're talking about though; I'm not even sure if recursion *is* the problem. – Ja͢ck Dec 22 '12 at 03:11
@KrisCraig You seem to have some serious misconceptions about how PCRE works. It is not a magic tool to speed stuff up. PCRE uses a backtracking algorithm, which has a worst case *exponential* runtime. And this is not an implementational issue, but a complexity theoretical one, because regular expressions with backreferences are NP-hard problems. A regular expression like `(.+).*\1` is exactly the kind of expression that will cause the regex engine serious trouble. Generally everything of the form (.*).* is a really bad idea. – NikiC Dec 22 '12 at 13:00
@KrisCraig That's why the regex engine will backtrack itself to death even with rather small inputs. PCRE *is* usually faster than implementing stuff manually in PHP, but only as long the regular expression is reasonably sane ;) – NikiC Dec 22 '12 at 13:01

score 2 · Answer 2 · answered Dec 14 '12 at 08:19

You can't get the required result in a single regex because a regex will match either greedily (finding bc...bc) or lazily (finding b...b and c...c), but never both. (In your case, it does find c...c, but only because c is repeated twice.)

But once you've found a repeated substring of length > 1, it logically follows that all the smaller "substrings of that substring" must also be repeated. If you want to get them spelled out for you, you need to do this separately.

Taking your example (using Python because I don't know PHP):

>>> results = set(m.group(1) for m in re.finditer(r"(?=(.+).*\1)", "abcbcdcde"))
>>> results
{'d', 'cd', 'bc', 'c'}

You could then go and apply the following function to each of your results:

def substrings(s):
    return [s[start:stop] for start in range(len(s)-1) 
                          for stop in range(start+1, len(s)+1)]

For example:

>>> substrings("123456")
['1', '12', '123', '1234', '12345', '123456', '2', '23', '234', '2345', '23456',
 '3', '34', '345', '3456', '4', '45', '456', '5', '56']

I was afraid somebody would say that. It makes perfect sense that regex wouldn't support something like this. I guess I was hoping it could be done and that I just wasn't figuring it out. That said, the regex you posted might cut down the iterations enough for this to be usable in production (see my comment on the other answer). I'll try it tomorrow and post a follow-up when I have the results. =) — Kris Craig, Dec 14 '12 at 11:37
@KrisCraig Turns out it doesn't cut down enough :) see my update. — Ja͢ck, Dec 14 '12 at 16:02

score 1 · Answer 3 · answered Dec 14 '12 at 07:57

1

The closest I can get is /(?=(.+).*\1)/

The purpose of the lookahead is to allow the same characters to be matched more than once (for instance, c and cd). However, for some reason it doesn't seem to be getting the b...

answered Dec 14 '12 at 07:57

Niet the Dark Absol

320,036
81
464
592

It doesn't get the `b` because it's already getting the `bc`. A single regex is not going to be able to do this. – Tim Pietzcker Dec 14 '12 at 07:58
That was what the lookahead was intended for. Is it still advancing too much? Oh, I think I see it... yes. – Niet the Dark Absol Dec 14 '12 at 08:00
@Kolink I actually tried that one earlier. Didn't work. GMTA lol. ;) – Kris Craig Dec 14 '12 at 11:48
Somehow this expression breaks down after 1000 chars when used in pcre :) – Ja͢ck Dec 15 '12 at 00:58

score 1 · Answer 4 · edited May 23 '17 at 12:05

Interesting question. I basically took the function in Jacks answer and was trying if the number of tests can be reduced.

I first tried to only search half the string, however it turned out that creating the pattern to search for via substr each time was way too expensive. The way how it is done in Jacks answer by appending one character per each iteration is way better it looks like. And then I did run out of time so I could not look further into it.

However while looking for such an alternative implementation I at least found out that some of the differences in the algorithm I had in mind could be applied to Jacks function as well:

There is no need to cut the beginning of the string in each outer iteration as the search is already done with offsets.
If the rest of the subject to look for repetition is smaller than the repetition needle, you do not need to search for the needle.
If it was already searched for the needle, you don't need to search again.

Note: This is a memory trade. If you have many repetitions, you will use similar memory. However if you do have a low amount of repetitions, than this variant uses more memory than before.

The function:

function find_repeating_sequences($string) {
    $result = array();
    $start  = 0;
    $max    = strlen($string);
    while ($start < $max) {
        $pat = $string[$start];
        $i   = ++$start;
        while ($max - $i > 0) {
            $found = isset($result[$pat]) ? $result[$pat] : false !== strpos($string, $pat, $i);
            if (!$result[$pat] = $found) break;
            // expand pattern and try again
            $pat .= $string[$i++];
        }
    }
    return array_keys(array_filter($result));
}

So just see this as an addition to Jacks answer.

Regex expression for matching all duplicate substrings of any length

4 Answers4