What is happening inside this regex alteration expression

Question

The following regular expresion works but can anyone explain how? Any comment is appreciated! Thanks! Quinoa

What is the regex "|" doing to strip the tags "" and "" from <script>Keep THIS</Script> to get "Keep THIS" into memory $1?

Here is the REGEX:

(?x)
([\w\.!?,\s-])|<.*?>|.

Here is the string:

 <script>Keep THIS</Script>

Results: $1 = "Keep THIS"

Commented below:

  (?x)                     set flags for this block (disregarding
                           whitespace and comments) (case-sensitive)
                           (with ^ and $ matching normally) (with .
                           not matching \n)

  (                        group and capture to \1:
    [\w\.!?,\s-]             any character of: word characters (a-z,
                             A-Z, 0-9, _), '\.', '!', '?', ',',
                             whitespace (\n, \r, \t, \f, and " "), '-
                             '
  )                        end of \1
 |                        OR
  <                        '<'
  .?                       any character except \n (optional
                           (matching the most amount possible))
  >                        '>'
 |                        OR
  .                        any character except \n

Please *always* show your Perl code. There are many ways to combine the components you have shown. — Borodin, Mar 03 '15 at 01:44

Avinash Raj · Accepted Answer · 2015-03-03T01:18:48.793

1

<.*?> matches all the tags , that is it matches all the strings which starts with < and endswith >. Then from the remaining string this ([\w\.!?,\s-]) regex would capture all the word character or dot or ! or ? or space or comma or hyphen. Note that it would capture each single character into group 1.

If you want to capture the whole string Keep THIS into group 1 then you need to add + quantifier next to the character class. + repeats the previous token one or more times.

([\w\.!?,\s-]+)|<.*?>|.

Finally the . matches all the remaining characters which are not matched.

DEMO

edited Mar 03 '15 at 01:18

answered Mar 03 '15 at 01:09

Avinash Raj

172,303
28
230
274

Thank you Avinash for the prompt response! Are you saying that there is an order of matching? Does the order begin from the right with the period before the last "|"? 1. First match is by '"|."' which is the whole string, 2. Second match are the tags and 3. Third match is the remaining string. – quinoa Mar 03 '15 at 01:13
At first `([\w\.!?,\s-]+)` matches all the word characters, spaces etc except `<`, `>`. Then after it sees the pattern `<.*?>`, it matches the tag strings leaves only the remaining in-between `Keep THis` string. But i always suggest you to write `<.*?>|([\w\.!?,\s-])|.` . Finally the . matches all the remaining characters which are not matched. – Avinash Raj Mar 03 '15 at 01:17
NOte that the regex engine parses the string from left to right. – Avinash Raj Mar 03 '15 at 01:21
Avinash, I removed the last '"|."' it will not work without it. Thanks! – quinoa Mar 03 '15 at 01:22
yep, the final `|.` is used to match all the remaining characters. So it's must. – Avinash Raj Mar 03 '15 at 01:23
my comments treating each case separately: ( group and capture to \1: [\w\.!?,\s-] any character of: word characters (a-z, A-Z, 0-9, _), '\.', '!', '?', ',', whitespace (\n, \r, \t, \f, and " "), '- ' ) end of \1 by itself matches – quinoa Mar 03 '15 at 01:28
| OR < '<' .? any character except \n (optional (matching the most amount possible)) > '>' by itself matches | OR . any character except \n by itself matches Here are – quinoa Mar 03 '15 at 01:31
note that at the end of `\1`, it matches `script` , `Keep THIS`, `Script` any single character from each. It won't match `<` or `>` or `/` because we failed to include that inside the character class. So once it see `<` , it matches all the `<` chars and from that it matches all the characters non-greedily upto the next closing `>` symbol. THis is it matches each char in this string `script` for second time. – Avinash Raj Mar 03 '15 at 01:35
Avinash, its pretty straightforward is we take each case separately, but then when we put them together in the same line it matches first the whole string as a capture group so it can be used in memory as $1, then it matches the '" – quinoa Mar 03 '15 at 01:38
`and then it uses the last period to bring what? ???` see https://regex101.com/r/cF9oW4/3 the last colon was matched by the last `|.` pattern. It is used to match all the other remaining charcaters. – Avinash Raj Mar 03 '15 at 01:40
I think I finally get it. It first matches the whole string but in parts: first as only characters in first group, secondly as < or > or /. But since the first group is placed in memory $1 we are set, but we still need to match each string a second time. I assume the second time around it shows: '""' right? Thanks! – quinoa Mar 03 '15 at 01:46
Avinash, I take your word for it, the second time around it matches only "Keep THIS" and disregards the second group (I wonder how?). It would be great to look under the hood and see the matching process. Thanks! – quinoa Mar 03 '15 at 01:49
Avinash, Thank you for all your help! Juan – quinoa Mar 03 '15 at 01:55
@quinoa since `<`, `>` are not matched by the first pattern, second pattern tries to match from `<` to `>` and this makes the first pattern to leave the it's previous matches. If `<`, `>` is included inside the char class of first pattern then it wont happens. – Avinash Raj Mar 03 '15 at 01:55
Accept an answer from here which helps you the most. – Avinash Raj Mar 03 '15 at 01:56
Thanks again! How do I vote for you? Thanks for helping me think this through! Respectfully, Quinoa – quinoa Mar 03 '15 at 02:00
I did not receive the option to accept your answer. I am new at this site. Thanks! Quinoa – quinoa Mar 03 '15 at 02:09
press the tick mark below the down arrow which belongs to this answer. – Avinash Raj Mar 03 '15 at 02:14
I give you upvote to nullify the downvote, it's a pity when the downvoter doesn't explain. – Toto Mar 03 '15 at 09:31

score 0 · Answer 2 · answered Mar 03 '15 at 01:42

The only way this does what you say is if you are using a global match in a loop, and don't have use warnings in place as you should.

Here's what I think you have, but using Data::Dump to display the contents of $1 instead of what is presumably print $1 in your own code. (It really helps a lot to show your actual Perl code instead of selected snippets.)

use strict;
use warnings;

use Data::Dump;

my $s = '<script>Keep THIS</Script>';

my $re = qr/(?x)
([\w\.!?,\s-])|<.*?>|./;

while ( $s =~ /$re/g ) {
  dd $1;
}

output

undef
"K"
"e"
"e"
"p"
" "
"T"
"H"
"I"
"S"
undef

The first pass is matching <script>, which isn't captured so $1 is undefined.
Subsequent passes match a single character from the class [\w\.!?,\s-], which consumes the string Keep THIS one character at a time.
Finally, the closing </Script> is matched without capturing, and leaves $1 undefined again.

undef is printed as a null string, and without warnings enabled you won't be alerted to it.

The solution is to always use a poper HTML parser to process HTML. Regular expressions are the wrong tool for the job.

What is happening inside this regex alteration expression

2 Answers2