preg_replace multiline match but preserve new lines

Question

I need a one liner that trims PHP from an HTML file. The trick is that I also need it to preserve the newlines previously taken up by the PHP lines.

php -r "echo preg_replace('/<\\\\?.*(\\\\?>|\$)/Us','', file_get_contents(\$argv[1]));" -- "./index.php"

This "works" but does not preserve the new lines, for example:

<html><?php test(); ?>
  <head>
    <?php test();

    ?>
  </head>
  <body>
  </body>
<html>

Resolves to:

<html>
  <head>

  </head>
  <body>
  </body>
<html>

But I need it to resolve to:

<html>
  <head>



  </head>
  <body>
  </body>
<html>

Maybe I am using a hammer to drive a screw but what I am trying to do is remove the PHP code, run the result through htmlhint and have the reported line numbers actually match the lines in the file.

If there is a better solution, I would love to hear it. The end goal is to lint files that have a mix of PHP, Javascript and HTML with their respective linters.

The only thing I can think of is to loop each line and replace if it's a PHP line or in between PHP tags. — Andreas, Oct 30 '17 at 20:21
Line breaks in an html file are ignored by browsers - so why would it matter? — , Oct 30 '17 at 20:23

ctwheels · Answer 1 · 2017-10-30T22:37:25.287

2

Brief

Regex is definitely not the best answer for this problem, but since you're looking for an answer in regular expression form, here you have it!

Note: This will break if a comment or string contains <?.

Code

See this regex in use here

(?:\G(?!\A)|\h*(?=<\?))(.*(?=(?:(?!<\?)[\s\S])*?(?<=\?>)))

Results

Input

<html><?php test(); ?>
  <head>
    <?php test();

    ?>
  </head>
  <body>
  </body>
<html>

Output

<html>
  <head>



  </head>
  <body>
  </body>
<html>

Explanation

(?:\G(?!\A)|\h*(?=<\?)) Match either of the following options
- \G(?!\A)
  - \G Assert position at the end of the previous match or the start of the string for the first match
  - (?!\A) Negative lookahead asserting what follows is not the start of the string (this basically makes \G only match the end of the previous match)
- \h*(?=<\?) Match the following
  - \h* Match any number of horizontal spaces (used for cleanup of whitespaces before <?
  - (?=<\?) Positive lookahead ensuring the following matches
    - < Match the less than character < literally
    - \? Match the question mark character ?literally
(.*(?=(?:(?!<\?)[\s\S])*?(?<=\?>))) Capture the following into capture group 1
- .* Match any character (except for line terminators) any number of times
- (?=(?:(?!<\?)[\s\S])*?(?<=\?>)) Positive lookahead ensuring what follows matches
  - (?:(?!<\?)[\s\S])*? Match the following any number of times, but as few as possible
    - (?!<\?) Negative lookahead ensuring what follows is not matched
      - < Match the less than character < literally
      - \? Match the question mark character ? literally
    - [\s\S] Match any character
  - (?<=\?>) Negative lookbehind ensuring what precedes matches the following
    - \? Match the question mark character ? literally
    - > Match the greater than character > literally

edited Oct 30 '17 at 22:37

answered Oct 30 '17 at 21:13

ctwheels

21,901
9
42
77

Sick regex! I know regex pretty well, but where do you learn this 'extreme' type of expressions? :) – silkfire Oct 30 '17 at 22:17
2

I play on regex101 *a lot*. I've also been answering regex questions for a few months now, each one gives me more knowledge than the last. Also, [Wiktor](https://stackoverflow.com/users/3832970/wiktor-stribi%C5%BCew) is a great help with regular expressions. He usually critiques my answers, but it helps me improve on them. If you flip through some of Wiktor's answers, you'll see a wide variety of regular expressions and for multiple languages. I usually default to PCRE regex (it supports more than most regex flavours). Also, [sln](https://stackoverflow.com/users/557597/sln) has good regexes. – ctwheels Oct 30 '17 at 22:24
@silkfire also, check out this link: https://stackoverflow.com/tags/regex/topusers. It has regex top users. Go though and see what sort of answers these users post. It's a great resource to keep if you're going to continue using regex. – ctwheels Oct 30 '17 at 22:27
Thank you, ctwheels, really appreciate the resources you've provided links to :) – silkfire Oct 31 '17 at 09:19
I agree regex is not the best answer but your answer is still badass. Will have to take the time to understand this. Your answer depth is impeccable. I am marking @casimir-et-hippolyte's comment as the answer because I was not specifically looking for regex and it seems much better suited for the use-case but thank you none-the-less! – dafky2000 Oct 31 '17 at 11:27
1

@dafky2000 no problem! It might help someone else in the future with a similar problem. I would definitely say casimir's answer is better, but this does provide an alternative entirely in regex. – ctwheels Oct 31 '17 at 13:09

Casimir et Hippolyte · Accepted Answer · 2017-10-31T12:44:05.683

Ok one line using the tokenizer (Ugly thing inside):

php -r 'echo array_reduce(token_get_all(file_get_contents($argv[1])),function($c,$i){return $i[0]==321?$c.$i[1]:$c.str_repeat("\n",@count_chars($i.$i[1])[10]);});'

demo

Advantage of the tokenizer: even a string like "abc <?php echo '?>'; ?> def" is correctly parsed.

321 is the value of the constant T_INLINE_HTML (all that isn't between php tags).

10 is ASCII code for the newline character (LF). (by default, count_chars returns an associative array with the ASCII codes as keys and the number of occurrences as values).

The ugly thing is $i.$i[1] that concatenates an array with a string or a string with something not defined. @ avoids the warnings and notices. Whatever, this trick avoids a test and the number of newline characters is preserved. (see what returns token_get_all to understand the problem).

Or with DOMDocument:

php -r '$d=DOMDocument::loadHTMLFile($argv[1],8196);foreach((new DOMXPath($d))->query("//processing-instruction()")as$p)$p->parentNode->replaceChild($d->createTextNode(preg_replace("~\S+~","",$p->nodeValue)),$p);echo$d->saveHTML();'

THIS. Infinitely more elegant! On one line it doesn't look nice but functionality-wise - much better! — dafky2000, Oct 31 '17 at 11:30
Also use `-d short_open_tag=On` to parse short tags `php -d short_open_tag=On -r 'echo array_reduce(token_get_all(file_get_contents($argv[1])),function($c,$i){return $i[0]==321?$c.$i[1]:$c.str_repeat(\"\\n\",@count_chars($i.$i[1])[10]);});'` — dafky2000, Oct 31 '17 at 11:41