5

This regular expression matches palindromes: ^((.)(?1)\2|.?)$

Can't wrap my head around how it works. When does the recursion end, and when regex breaks from the recursive subpattern and goes to "|.?" part?

Thanks.

edit: sorry I didn't explain \2 and (?1)

(?1) - refers to first subpattern (to itself)

\2 - back-reference to a match of second subpattern, which is (.)

Above example written in PHP. Matches both "abba" (no mid palindrome character) and "abcba" - has a middle, non-reflected character

alexy2k
  • 560
  • 1
  • 6
  • 10

3 Answers3

4

^((.)(?1)\2|.?)$

The ^ and $ asserts the beginning and the end of the string respectively. Let us look at the content in between, which is more interesting:

((.)(?1)\2|.?)
1------------1 // Capturing group 1
 2-2           // Capturing group 2

Look at the first part (.)(?1)\2, we can see that it will try to match any character, and that same character at the end (back reference \2, which refers to the character matched by (.)). In the middle, it will recursively match for the whole capturing group 1. Note that there is an implicit assertion (caused by (.) matching one character at the beginning and \2 matching the same character at the end) that requires the string to be at least 2 characters. The purpose of the first part is chopping the identical ends of the string, recursively.

Look at second part .?, we can see that it will match one or 0 character. This will only be matched if the string initially has length 0 or 1, or after the leftover from the recursive match is 0 or 1 character. The purpose of the second part is to match the empty string or the single lonely character after the string is chopped from both ends.

The recursive matching works:

  • The whole string must be palindrome to pass, asserted by ^ and $. We cannot start matching from any random position.
  • If the string is <= 1 character, it passes.
  • If the string is > 2 characters, whether it is accepted is decided by the first part only. And it will be chopped by 2 ends if matches.
  • The leftover if matches, can only be chopped by the 2 ends, or passes if its length is <= 1.
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • Thanks for the reply. I've edited the original post with more explanation. `\2` is not length. – alexy2k Jul 26 '12 at 15:59
  • 1
    `\2` is not length, but it does force that expression to be at least two characters long, because a single character can't match both `(.)` and the backreference `\2`. – Sam Mussmann Jul 26 '12 at 16:06
  • 2
    Each time it recurses because of the first part, it's processing the string with the two end characters removed. Eventually it shrinks to 0 or 1 characters, then the second part matches and the recursion stops. – Barmar Jul 26 '12 at 16:19
  • 1
    I do admit it is a bit confusing. I have edited my post, but I'm not sure if it is any clearer. – nhahtdh Jul 26 '12 at 16:31
  • Answer accepted. Thank you! I admin I still have troubles comprehending it, perhaps I need to learn automata. one more question: when a function calls itself, it goes "back" when it reaches "bottom" state (say, 1 in calculating factorial) and multiplies that by previous call results, but when is the finite state reached here? At what point, regex engine goes "now I go match "\2" in reverse order, one by one", if it makes sense? – alexy2k Jul 26 '12 at 18:25
  • 1
    @alexy2k: I don't know how the regex engine works. I imagine that it will stupidly go into recursion match `(?1)` first, then it will match `\2` (since in a general regex, you won't know if there are more recursive groups ahead). This is more of guessing, so don't take this comment too seriously. – nhahtdh Jul 26 '12 at 18:56
4

The regex is essentially equivalent to the following pseudo-code:

palin(str) {
    if (length(str) >= 2) {
      first = str[0];
      last = str[length(str)-1];
      return first == last && palin(substr(str, 1, length(str)-2));
    } else
      // empty and single-char trivially palindromes
      return true;
}
Barmar
  • 741,623
  • 53
  • 500
  • 612
1

I haven't found any nice debugging utility for PCRE regexps. The more I could find was how to dump the bytecode:

$ pcretest -b
PCRE version 7.6 2008-01-28

  re> /^((.)(?1)\2|.?)$/x
------------------------------------------------------------------
  0  39 Bra
  3     ^
  4  26 CBra 1
  9   6 CBra 2
 14     Any
 15   6 Ket
 18   6 Once
 21   4 Recurse
 24   6 Ket
 27     \2
 30   5 Alt
 33     Any?
 35  31 Ket
 38     $
 39  39 Ket
 42     End
------------------------------------------------------------------

Perl has better debugging tools than PCRE, try echo 123454321 | perl -Mre=debug -ne '/^((.)(?1)\2|.?)$/x'. This gives not only some bytecode that is similar to PCRE's one, but it also shows each step, and the consumed and remaining parts of the input at each step:

Compiling REx "^((.)(?1)\2|.?)$"
Final program:
   1: BOL (2)
   2: OPEN1 (4)
   4:   BRANCH (15)
   5:     OPEN2 (7)
   7:       REG_ANY (8)
   8:     CLOSE2 (10)
  10:     GOSUB1[-8] (13)
  13:     REF2 (19)
  15:   BRANCH (FAIL)
  16:     CURLY {0,1} (19)
  18:       REG_ANY (0)
  19: CLOSE1 (21)
  21: EOL (22)
  22: END (0)
floating ""$ at 0..2147483647 (checking floating) anchored(BOL) minlen 0 
Guessing start of match in sv for REx "^((.)(?1)\2|.?)$" against "12321"
Found floating substr ""$ at offset 5...
Guessed: match at offset 0
Matching REx "^((.)(?1)\2|.?)$" against "12321"
   0 <> <12321>              |  1:BOL(2)
   0 <> <12321>              |  2:OPEN1(4)
   0 <> <12321>              |  4:BRANCH(15)
   0 <> <12321>              |  5:  OPEN2(7)
   0 <> <12321>              |  7:  REG_ANY(8)
   1 <1> <2321>              |  8:  CLOSE2(10)
   1 <1> <2321>              | 10:  GOSUB1[-8](13)
   1 <1> <2321>              |  2:    OPEN1(4)
   1 <1> <2321>              |  4:    BRANCH(15)
   1 <1> <2321>              |  5:      OPEN2(7)
   1 <1> <2321>              |  7:      REG_ANY(8)
   2 <12> <321>              |  8:      CLOSE2(10)
   2 <12> <321>              | 10:      GOSUB1[-8](13)
   2 <12> <321>              |  2:        OPEN1(4)
   2 <12> <321>              |  4:        BRANCH(15)
   2 <12> <321>              |  5:          OPEN2(7)
   2 <12> <321>              |  7:          REG_ANY(8)
   3 <123> <21>              |  8:          CLOSE2(10)
   3 <123> <21>              | 10:          GOSUB1[-8](13)
   3 <123> <21>              |  2:            OPEN1(4)
   3 <123> <21>              |  4:            BRANCH(15)
   3 <123> <21>              |  5:              OPEN2(7)
   3 <123> <21>              |  7:              REG_ANY(8)
   4 <1232> <1>              |  8:              CLOSE2(10)
   4 <1232> <1>              | 10:              GOSUB1[-8](13)
   4 <1232> <1>              |  2:                OPEN1(4)
   4 <1232> <1>              |  4:                BRANCH(15)
   4 <1232> <1>              |  5:                  OPEN2(7)
   4 <1232> <1>              |  7:                  REG_ANY(8)
   5 <12321> <>              |  8:                  CLOSE2(10)
   5 <12321> <>              | 10:                  GOSUB1[-8](13)
   5 <12321> <>              |  2:                    OPEN1(4)
   5 <12321> <>              |  4:                    BRANCH(15)
   5 <12321> <>              |  5:                      OPEN2(7)
   5 <12321> <>              |  7:                      REG_ANY(8)
                                                        failed...
   5 <12321> <>              | 15:                    BRANCH(19)
   5 <12321> <>              | 16:                      CURLY {0,1}(19)
                                                        REG_ANY can match 0 times out of 1...
   5 <12321> <>              | 19:                        CLOSE1(21)
                                                          EVAL trying tail ... 9d86dd8
   5 <12321> <>              | 13:                          REF2(19)
                                                            failed...
                                                        failed...
                                                      BRANCH failed...
   4 <1232> <1>              | 15:                BRANCH(19)
   4 <1232> <1>              | 16:                  CURLY {0,1}(19)
                                                    REG_ANY can match 1 times out of 1...
   5 <12321> <>              | 19:                    CLOSE1(21)
                                                      EVAL trying tail ... 9d86d70
   5 <12321> <>              | 13:                      REF2(19)
                                                        failed...
   4 <1232> <1>              | 19:                    CLOSE1(21)
                                                      EVAL trying tail ... 9d86d70
   4 <1232> <1>              | 13:                      REF2(19)
                                                        failed...
                                                    failed...
                                                  BRANCH failed...
   3 <123> <21>              | 15:            BRANCH(19)
   3 <123> <21>              | 16:              CURLY {0,1}(19)
                                                REG_ANY can match 1 times out of 1...
   4 <1232> <1>              | 19:                CLOSE1(21)
                                                  EVAL trying tail ... 9d86d08
   4 <1232> <1>              | 13:                  REF2(19)
                                                    failed...
   3 <123> <21>              | 19:                CLOSE1(21)
                                                  EVAL trying tail ... 9d86d08
   3 <123> <21>              | 13:                  REF2(19)
                                                    failed...
                                                failed...
                                              BRANCH failed...
   2 <12> <321>              | 15:        BRANCH(19)
   2 <12> <321>              | 16:          CURLY {0,1}(19)
                                            REG_ANY can match 1 times out of 1...
   3 <123> <21>              | 19:            CLOSE1(21)
                                              EVAL trying tail ... 9d86ca0
   3 <123> <21>              | 13:              REF2(19)
   4 <1232> <1>              | 19:              CLOSE1(21)
                                                EVAL trying tail ... 0
   4 <1232> <1>              | 13:                REF2(19)
   5 <12321> <>              | 19:                CLOSE1(21)
   5 <12321> <>              | 21:                EOL(22)
   5 <12321> <>              | 22:                END(0)
Match successful!
Freeing REx: "^((.)(?1)\2|.?)$"

As you can see, Perl first consumes all input recursing until (.) fails, then starts backtracking and trying the second branch from the alternation .? and the remainder of the first part \2, when that fails it backtracks, until it finally succeeds.

ninjalj
  • 42,493
  • 9
  • 106
  • 148
  • Why in this debug, there are six `OPEN1` but correspondingly eight `CLOSE1`? Should it not be six `CLOSE1` as well? – revo Jun 27 '16 at 21:59