5

Javadoc for java.util.regex.Pattern says \cx represents The control character corresponding to x. So I thought Pattern.compile() would reject a \c followed by any character other than [@-_], but it doesn't!

As @tchrist commented on one of the answers to What is a regular expression for control characters?, range is not checked at all. I tested a couple characters from higher blocks and also astral planes, looks like it merely flips the 7th lowest bit of the codepoint value.

So is it a Javadoc bug or an implementation bug or am I misunderstanding something? Is \cx a Java-invented syntax or is it supported by other regex engines, especially Perl? How is it handled there?

Community
  • 1
  • 1

1 Answers1

5

All versions of Perl behave the same for the following escapes:

  • When \c is followed by an ASCII uppercase letter or one of @[\]^_?,

    chr(ord($char) ^ 0x40)

    This provides full coverage of all ASCII control characters (0x00..0x1F, 0x7F).

    \c@ === \x00
    \cA === \x01
    ...
    \cZ === \x1A
    \c[ === \x1B
    \c\ === \x1C   # Sometimes \c\\ is needed.
    \c] === \x1D
    \c^ === \x1E
    \c_ === \x1F
    \c? === \x7F
    
  • When \c is followed by an ASCII lowercase letter,

    chr(ord($char) ^ 0x60)

    This makes the escape case-insensitive.

    \ca === \cA === \x01
    ...
    \cz === \cZ === \x1A
    

No other sequence make sense, but error checking was only introduced in Perl 5.20.

  • ≥5.20,

    • When \c is followed by a space, an ASCII digit or one of !"#$%&'()*+,-./:;<=>{|}~,

      chr(ord($char) ^ 0x40), but warns (is more clearly written simply as).

    • When \c is followed by an ASCII control character (0x00..0x1F, 0x7F) or a non-ASCII character (≥0x80),

      Fatal error Character following "\c" must be printable ASCII.

  • <5.20,

    • When \c is followed by a space, an ASCII digit, one of one of !"#$%&'()*+,-./:;<=>{|}~ or an ASCII control character (0x00..0x1F, 0x7F),

      chr(ord($char) ^ 0x40)

    • When \c is followed by character ≥0x100,

      Total garbage (chr(ord(substr(encode_utf8($char, 0, 1)) ^ 0x40) . encode_utf8($char, 1)).

    • When \c is followed by character 0x80..0xFF,

      Depending on the internal storage format of the string, produces either chr(ord($char) ^ 0x40) or the same total garbage as for characters ≥0x100.

ikegami
  • 367,544
  • 15
  • 269
  • 518