Regular expression for removing white space and comments from free-spacing and comments mode regular expressions

Question

I'm writing a PCRE regular expression for the purpose of 'minifying' other PCRE regular expressions written in free-spacing and comments mode (/x flag), such as:

# Match a 20th or 21st century date in yyyy-mm-dd format
(19|20)\d\d                # year (group 1)
[- /.]                     # separator - dash, space, slash or period
(0[1-9]|1[012])            # month (group 2)
[- /.]                     # separator - dash, space, slash or period
(0[1-9]|[12][0-9]|3[01])   # day (group 3)

Note: I've intentionally omitted any regular expression delimiters and x flag

The result of 'minifying' the above expression should be that all literal whitespace characters (including new lines) and comments are removed, except literal spaces within a character class (e.g. [- /.]) and escape whitespace characters (e.g. \):

(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])

This is the regular expression I have so far, itself written in free-spacing and comments mode (https://regex101.com/r/RHnyWw/2/):

(?<!\\)\s          # Match any non-escaped whitespace character
|
(?<!\\)\#.*\s*$    # Match comments (any text following non-escaped #)

Assuming I substitute all matches with empty string, the result is:

(19|20)\d\d[-/.](0[1-9]|1[012])[-/.](0[1-9]|[12][0-9]|3[01])

This is close, except that the space characters with the separator [- /.] parts of the pattern have lost the literal space.

How can I change this pattern so that literal space (and #) characters with [ and ] are preserved?

Isn't regex itself a context-sensitive grammar, and hence cannot be parsed by regexes, which can only parse regular grammars? — Sweeper, May 22 '19 at 10:19
What do you mean by 'parse'? All I want to do is remove whitespace from the regular expression - I wouldn't call that parsing. — Dan Stevens, May 22 '19 at 13:44
But you also want to exclude those white spaces that are in character classes, which requires figuring out where the character classes are, which is parsing the regular expression. — Sweeper, May 22 '19 at 14:21
Parsing here means differentiating meaningful part of a regular expression from meaningless part which is not possible using a regular expression. — revo, May 22 '19 at 15:00
I understand your points now - you're right that regular expressions are context sensitive, however that doesn't necessarily mean it's not possible to write regular expression to match context-sensitive grammers. See https://stackoverflow.com/questions/612654 and https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html — Dan Stevens, May 22 '19 at 16:01
What if I changed the requirements a little? What if I said that a space or # within a character class must appear immediately after the `[` or `[-` (if a dash is required in the character class? It will mean the regex couldn't minify any possible regex, but I would at least know that if I write my own free-spaced regex this way, I can minify them. — Dan Stevens, May 22 '19 at 16:05

score 0 · Answer 1 · answered May 22 '19 at 16:57

0

May be this regex can help

(?:\[(?:[^\\\]]++|\\.)*+\]|\\.)(*SKIP)(*F)|\#.*?$|\s++

answered May 22 '19 at 16:57

Michail

843
4
11

For instance, it fails a regex like this `(?:#str)`. If you patch this, you face another problem. – revo May 22 '19 at 17:59
It seems this regex is erroneous. May be you mean "(?#str)". I can patch this and patch another problems (\Q...\E, [:[...]:] and may be other) very simply, if topicstarter needs this. – Michail May 22 '19 at 18:35
I mean it exactly as I wrote it. Patching here doesn't have an end. It takes forever and is always prone to wrong matches. – revo May 22 '19 at 20:40
Ha. Patching have a quick end as PCRE have bounded syntax. Again: if topicstarter really needs to use more compicated regexes than in his example, then I'll wrote enough precise regex. – Michail May 23 '19 at 04:41
I'm not familiar with the `(*SKIP)(*F)` control verbs. What's the support for this like? Is it supported in PHP (specifically `preg_match_all`)? – Dan Stevens May 23 '19 at 12:01
@DanStevens there are many cases that it fails. You can find some of them here https://regex101.com/r/3EVpuH/4 – revo May 23 '19 at 20:36
@revo Some of the suggested failures (e.g. `(?:#str)` and `(#) # comment`) are assuming that the regex author doesn't have to escape their hashes when writing with free-spacing mode enabled. However, I think it's the case that the first hash is always treated as a comment delimiter regardless. Therefore I think it's safe to assume the regex author would have written your examples as `(?:\#str)` and `(\#) # comment`, which don't fail. See bottom of https://regex101.com/r/3EVpuH/7 – Dan Stevens May 24 '19 at 08:59
Although one of your examples, `[^]\[#] # comment` looks to be indeed a valid edge case - the first `#` is being matched even though it is technically within an inverted character class. Assuming regex101.com's parser is representative, I'd expect the first `]` to generate an error due to being an unescaped, but this doesn't appear to be the case. I think this is an very unlikely to occur in real use. – Dan Stevens May 24 '19 at 09:08
@DanStevens It's the expected behavior of PCRE and many other regex engines to allow unescaped brackets `][` to be used at the beginning of character class to represent literal brackets: `[][]`. There are other cases too: `[[:space:]#]`. I think I warned you enough. This is you to decide. – revo May 24 '19 at 09:39

Dan Stevens · Answer 2 · 2019-05-23T12:27:49.693

Here's my solution:

# Match any literal whitespace character, except when within a valid character class
# at first position, or second position after `-`
(?<!\\|(?<!\\)\[|(?<!\\)\[-)\s 
|
# Match comments (any text following a literal # until end-of-line), except when
# within a character class at first position, or second position after `-` or third
# position after `- `
(?<!\\|(?<!\\)\[|(?<!\\)\[-|(?<!\\)\[\ |(?<!\\)\[-\ )\#.*$\r?\n?

The results of of minifying itself are:

(?<!\\|(?<!\\)\[|(?<!\\)\[-)\s|(?<!\\|(?<!\\)\[|(?<!\\)\[-|(?<!\\)\[\ |(?<!\\)\[-\ )\#.*$\r?\n?

https://regex101.com/r/3EVpuH/1

An advantage of this solution is that it doesn't depend on backtracking control verbs (which I'd not heard of until I looked into after seeing Michail's solution).

The disadvantage (over Michail's solution) is that if you want to specify a dash, space and/or # characters within a character class, they must appear in a specific order: dash, space then hash i.e. [- #]. I don't know is this requirement can be eliminated without using control verbs.

I don't quite understand why absent of control verbs is an advantage. In any case you can rewrite that regex without control verbs: ``(?:\[(?:[^\\\]]++|\\.)*+\]|\\.|[^\#\s])*+\K(?:\#.*?$|\s++)`` — Michail, May 23 '19 at 19:06
If you don't want to have unexpected behaviors, never choose this solution. — revo, May 23 '19 at 20:38

Regular expression for removing white space and comments from free-spacing and comments mode regular expressions

2 Answers2