7

I recently came across a puzzle to find a regular expression that matches:

5-character-long strings comprised of lowercase English letters in ascending ASCII order

Valid examples include:

aaaaa
abcde
xxyyz
ghost
chips
demos

Invalid examples include:

abCde
xxyyzz
hgost
chps

My current solution is kludgy. I use the regex:

(?=^[a-z]{5}$)^(a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*)$

which uses a non-consuming capture group to assert a string length of 5, and then verifies that the string comprises of lowercase English letters in order (see Rubular).

Instead, I'd like to use back references inside character classes. Something like:

^([a-z])([\1-z])([\2-z])([\3-z])([\4-z])$

The logic for the solution (see Rubular) in my head is to capture the first character [a-z], use it as a backrefence in the second character class and so on. However, \1, \2 ... within character classes seem to refer to ASCII values of 1, 2... effectively matching any four- or five-character string.

I have 2 questions:

  1. Can I use back references in my character classes to check for ascending order strings?
  2. Is there any less-hacky solution to this puzzle?
Jedi
  • 3,088
  • 2
  • 28
  • 47
  • 3
    As far as I can tell, your kludgy solution is as sweet as it gets because the character classes don't play nice with your backreferences (but I do like your logic, seems like a nice feature to have). Do you have a specific environment to run this regex in (Ruby only or agnostic)? I am sure the regex wizards of SO will be along shortly to add their expertise. – mickmackusa Jun 30 '17 at 22:35
  • 2
    Uh, no backreferences inside character classes. The reason is that character classes are composed at _compile time_. The determining factor is that a backreference is dynamic, and what disqualifies it inside a class is the range operator. So, they say.. _No dynamic range_ elst the engine crashes in a C++ exception. –  Jul 13 '17 at 18:55
  • 1
    `came across a puzzle to find a regular expression` - Be sure to _cite_ that link so we may laugh at the puzzler.. –  Jul 13 '17 at 18:58
  • Actually, it's quite easy in a Perl regex. You can do all sorts of stuff, like counting, sequences, bools, subtraction, etc... `(?{..})`. If you think you'd use Perl, then it's .. doable. Btw, your first regex is just fine. –  Jul 13 '17 at 18:59
  • One other thing to note is that in character classes the _range_ operator pertains to single characters (min,max) not a set of characters, like `\pL-z` throws a construction error. In that vein, a reference can contain multiple characters. –  Jul 13 '17 at 19:39
  • Why is `demos` included in invalid list? – anubhava Jul 13 '17 at 22:05
  • 1
    ok, your regex looks pretty nice to me. You can slightly improve it by using: `^(?=[a-z]{5}$)a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$` – anubhava Jul 13 '17 at 22:09
  • 1
    duplicate of [https://stackoverflow.com/questions/3171671/regex-5-digits-in-increasing-order](https://stackoverflow.com/questions/3171671/regex-5-digits-in-increasing-order) – A1m Jul 14 '17 at 00:45
  • Question cannot be closed due to bounty, but that is a dead-on dupe link. – mickmackusa Jul 14 '17 at 09:54

4 Answers4

4

I'm posting this answer more as a comment than an answer since it has better formatting than comments.

Related to your questions:

  1. Can I use back references in my character classes to check for ascending order strings?

No, you can't. If you take a look a backref regular-expressions section, you will find below documentation:

Parentheses and Backreferences Cannot Be Used Inside Character Classes

Parentheses cannot be used inside character classes, at least not as metacharacters. When you put a parenthesis in a character class, it is treated as a literal character. So the regex [(a)b] matches a, b, (, and ).

Backreferences, too, cannot be used inside a character class. The \1 in a regex like (a)[\1b] is either an error or a needlessly escaped literal 1. In JavaScript it's an octal escape.

Regarding your 2nd question:

  1. Is there any less-hacky solution to this puzzle?

Imho, your regex is perfectly well, you could shorten it very little at the beginning like this:

(?=^.{5}$)^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$
    ^--- Here

Regex demo

Community
  • 1
  • 1
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • 1
    I think it's worth saying *why* backreferences in character classes are not possible. A backreference can point to a matched substring, which could be any length, even empty. Thus, code like `[\1-z]` could yield `[-z]`, `[f-z]`, `[foo-z]` etc. That doesn't make much sense and the pattern would have to be recompiled on the fly, preventing optimizations. – Lucas Trzesniewski Jul 15 '17 at 12:26
  • 1
    It's disturbing to see doc's quoted when they don't provide reasons for if something can't be used in such a fundamental way. References inside character classes is such an example. But, you could always use your own reasoning for such a case, and it's fairly easy to know why in this case. –  Jul 15 '17 at 21:10
3

If you are willing to use Perl (!), this will work:

/^([a-z])((??{"[$1-z]"}))((??{"[$2-z]"}))((??{"[$3-z]"}))(??{"[$4-z]"})$/
NetMage
  • 26,163
  • 3
  • 34
  • 55
  • 1
    But... that's cheating! :) If you use Perl then I guess you could just code the assertion directly: `/^([a-z]{5})$(?(?{$1 eq (join "", sort split qr##, $1)})|(?!))/` – Lucas Trzesniewski Jul 15 '17 at 12:21
  • At some point I think it is questionable as to whether you are doing a Regex solution. I think mine is closer on the continuum :) – NetMage Jul 17 '17 at 19:03
2

Since someone has broken the ice by using Perl, this is a
Perl solution I guess ..


Note that this is a basic non-regex solution that just happens to be
stuffed into code constructs inside a Perl regex.
The interesting thing is that if a day comes when you need the synergy
of regex/code this is a good choice.
It is possible then that instead of a simple [a-z] character, you may
use a very complex pattern in it's place and using a check vs. last.
That is power !!


The regex ^(?:([a-z])(?(?{ $last gt $1 })(?!)|(?{ $last = $1 }))){5}$

Perl code

use strict;
use warnings;


$/ = "";

my @DAry = split /\s+/, <DATA>;

my $last;

for (@DAry)
{
    $last = '';
    if ( 
      /
         ^                             # BOS
         (?:                           # Cluster begin
              ( [a-z] )                     # (1), Single a-z letter
                                            # Code conditional
              (?(?{
                   $last gt $1                  # last > current ?
                })
                   (?!)                          # Fail
                |                              # else,
                   (?{ $last = $1 })             # Assign last = current
              )
         ){5}                          # Cluster end, do 5 times
         $                             # EOS
      /x )
    {
        print "good   $_\n";
    }
    else {
        print "bad    $_\n";
    }
}

__DATA__

aaaaa
abcde
xxyyz
ghost
chips
demos
abCde
xxyyzz
hgost
chps

Output

good   aaaaa
good   abcde
good   xxyyz
good   ghost
good   chips
good   demos
bad    abCde
bad    xxyyzz
bad    hgost
bad    chps
2

Ah, well, it's a finite set, so you can always enumerate it with alternation! This emits a "brute force" kind of regex in a little perl REPL:

#include <stdio.h>

int main(void) {
  printf("while (<>) { if (/^(?:");
  for (int a = 'a'; a <= 'z'; ++a)
    for (int b = a; b <= 'z'; ++b)
      for (int c = b; c <= 'z'; ++c) {
        for (int d = c; d <= 'y'; ++d)
          printf("%c%c%c%c[%c-z]|", a, b, c, d, d);
        printf("%c%c%czz", a, b, c);
        if (a != 'z' || b != 'z' || c != 'z') printf("|\n");
      }
  printf(")$/x) { print \"Match!\\n\" } else { print \"No match.\\n\" }}\n");
  return 0;
}

And now:

$ gcc r.c
$ ./a.out > foo.pl
$ cat > data.txt
aaaaa
abcde
xxyyz
ghost
chips
demos
abCde
xxyyzz
hgost
chps
^D
$ perl foo.pl < data.txt
Match!
Match!
Match!
Match!
Match!
Match!
No match.
No match.
No match.
No match.

The regex is only 220Kb or so ;-)

Gene
  • 46,253
  • 4
  • 58
  • 96
  • Aargh. Your regex broke Rubular (can't make a permalink). :-) Yep, it's [219 KB raw](https://pastebin.com/raw/CvANkCM8) and works! – Jedi Jul 20 '17 at 04:55
  • Thanks for this. Using the raw regex text (from @Jedi paste bin) I put it through this [ternary tool](http://www.regexformat.com/version_files/Rx5_ScrnSht01.jpg) and made a full blown regex trie. That reduced the size quite a bit (160k [compressed](http://www.regexformat.com/Dnl/_Samples/_Ternary_Tool%20(Dictionary)/___txt/__Misc/Series_a-z_Ascending_5_letters_Compressed.txt), 840k [formatted](http://www.regexformat.com/Dnl/_Samples/_Ternary_Tool%20(Dictionary)/___txt/__Misc/Series_a-z_Ascending_5_letters_Formatted.txt)). Here is a [Python test](http://rextester.com/CECLB89681). –  Jul 22 '17 at 22:45
  • 1
    Also, I generated the full 142,506 possible strings, then used the full trie regex I made to test the performance: `Regex1: Options: < none > Completed iterations: 1 / 1 ( x 1 ) Matches found per iteration: 142506 Elapsed Time: 0.18 s, 178.87 ms, 178871 µs` –  Jul 22 '17 at 22:57