0

I need a one liner that trims PHP from an HTML file. The trick is that I also need it to preserve the newlines previously taken up by the PHP lines.

php -r "echo preg_replace('/<\\\\?.*(\\\\?>|\$)/Us','', file_get_contents(\$argv[1]));" -- "./index.php"

This "works" but does not preserve the new lines, for example:

<html><?php test(); ?>
  <head>
    <?php test();

    ?>
  </head>
  <body>
  </body>
<html>

Resolves to:

<html>
  <head>

  </head>
  <body>
  </body>
<html>

But I need it to resolve to:

<html>
  <head>



  </head>
  <body>
  </body>
<html>

Maybe I am using a hammer to drive a screw but what I am trying to do is remove the PHP code, run the result through htmlhint and have the reported line numbers actually match the lines in the file.

If there is a better solution, I would love to hear it. The end goal is to lint files that have a mix of PHP, Javascript and HTML with their respective linters.

dafky2000
  • 74
  • 10
  • The only thing I can think of is to loop each line and replace if it's a PHP line or in between PHP tags. – Andreas Oct 30 '17 at 20:21
  • 1
    Line breaks in an html file are ignored by browsers - so why would it matter? –  Oct 30 '17 at 20:23

2 Answers2

2

Brief

Regex is definitely not the best answer for this problem, but since you're looking for an answer in regular expression form, here you have it!

Note: This will break if a comment or string contains <?.


Code

See this regex in use here

(?:\G(?!\A)|\h*(?=<\?))(.*(?=(?:(?!<\?)[\s\S])*?(?<=\?>)))

Results

Input

<html><?php test(); ?>
  <head>
    <?php test();

    ?>
  </head>
  <body>
  </body>
<html>

Output

<html>
  <head>



  </head>
  <body>
  </body>
<html>

Explanation

  • (?:\G(?!\A)|\h*(?=<\?)) Match either of the following options
    • \G(?!\A)
      • \G Assert position at the end of the previous match or the start of the string for the first match
      • (?!\A) Negative lookahead asserting what follows is not the start of the string (this basically makes \G only match the end of the previous match)
    • \h*(?=<\?) Match the following
      • \h* Match any number of horizontal spaces (used for cleanup of whitespaces before <?
      • (?=<\?) Positive lookahead ensuring the following matches
        • < Match the less than character < literally
        • \? Match the question mark character ?literally
  • (.*(?=(?:(?!<\?)[\s\S])*?(?<=\?>))) Capture the following into capture group 1
    • .* Match any character (except for line terminators) any number of times
    • (?=(?:(?!<\?)[\s\S])*?(?<=\?>)) Positive lookahead ensuring what follows matches
      • (?:(?!<\?)[\s\S])*? Match the following any number of times, but as few as possible
        • (?!<\?) Negative lookahead ensuring what follows is not matched
          • < Match the less than character < literally
          • \? Match the question mark character ? literally
        • [\s\S] Match any character
      • (?<=\?>) Negative lookbehind ensuring what precedes matches the following
        • \? Match the question mark character ? literally
        • > Match the greater than character > literally
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • Sick regex! I know regex pretty well, but where do you learn this 'extreme' type of expressions? :) – silkfire Oct 30 '17 at 22:17
  • 2
    I play on regex101 *a lot*. I've also been answering regex questions for a few months now, each one gives me more knowledge than the last. Also, [Wiktor](https://stackoverflow.com/users/3832970/wiktor-stribi%C5%BCew) is a great help with regular expressions. He usually critiques my answers, but it helps me improve on them. If you flip through some of Wiktor's answers, you'll see a wide variety of regular expressions and for multiple languages. I usually default to PCRE regex (it supports more than most regex flavours). Also, [sln](https://stackoverflow.com/users/557597/sln) has good regexes. – ctwheels Oct 30 '17 at 22:24
  • @silkfire also, check out this link: https://stackoverflow.com/tags/regex/topusers. It has regex top users. Go though and see what sort of answers these users post. It's a great resource to keep if you're going to continue using regex. – ctwheels Oct 30 '17 at 22:27
  • Thank you, ctwheels, really appreciate the resources you've provided links to :) – silkfire Oct 31 '17 at 09:19
  • I agree regex is not the best answer but your answer is still badass. Will have to take the time to understand this. Your answer depth is impeccable. I am marking @casimir-et-hippolyte's comment as the answer because I was not specifically looking for regex and it seems much better suited for the use-case but thank you none-the-less! – dafky2000 Oct 31 '17 at 11:27
  • 1
    @dafky2000 no problem! It might help someone else in the future with a similar problem. I would definitely say casimir's answer is better, but this does provide an alternative entirely in regex. – ctwheels Oct 31 '17 at 13:09
0

Ok one line using the tokenizer (Ugly thing inside):

php -r 'echo array_reduce(token_get_all(file_get_contents($argv[1])),function($c,$i){return $i[0]==321?$c.$i[1]:$c.str_repeat("\n",@count_chars($i.$i[1])[10]);});'

demo

Advantage of the tokenizer: even a string like "abc <?php echo '?>'; ?> def" is correctly parsed.

321 is the value of the constant T_INLINE_HTML (all that isn't between php tags).

10 is ASCII code for the newline character (LF). (by default, count_chars returns an associative array with the ASCII codes as keys and the number of occurrences as values).

The ugly thing is $i.$i[1] that concatenates an array with a string or a string with something not defined. @ avoids the warnings and notices. Whatever, this trick avoids a test and the number of newline characters is preserved. (see what returns token_get_all to understand the problem).


Or with DOMDocument:

php -r '$d=DOMDocument::loadHTMLFile($argv[1],8196);foreach((new DOMXPath($d))->query("//processing-instruction()")as$p)$p->parentNode->replaceChild($d->createTextNode(preg_replace("~\S+~","",$p->nodeValue)),$p);echo$d->saveHTML();'
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • THIS. Infinitely more elegant! On one line it doesn't look nice but functionality-wise - much better! – dafky2000 Oct 31 '17 at 11:30
  • 1
    Also use `-d short_open_tag=On` to parse short tags `php -d short_open_tag=On -r 'echo array_reduce(token_get_all(file_get_contents($argv[1])),function($c,$i){return $i[0]==321?$c.$i[1]:$c.str_repeat(\"\\n\",@count_chars($i.$i[1])[10]);});'` – dafky2000 Oct 31 '17 at 11:41