1

I am trying to build a function which cleans all the elements with empty content or without attributes from HTML document and I want to igonre <br> and <hr> tags.

This is my current code.

<?php

function clean($html){
    $dom = new DOMDocument();
    $dom->loadHTML($html)
    $xpath = new DOMXPath($dom);

    while (($node_list = $xpath->query('//*[not(*) and not(@*) and not(text()[normalize-space()])]')) && $node_list->length) {
        foreach ($node_list as $node) {
            $node->parentNode->removeChild($node);
        }
    }

    return $dom->saveHTML();

}

How could I select all nodes ignoring the the br and hr using the xpath query.

Sample imput example:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>
    <p><strong><span style="font-size:16px;">Specifications:</span></strong><br>
    <span style="font-size:16px;">1. Material: Polymer</span><br>
    <span style="font-size:16px;">2. Weight: 7.94oz / 225g</span><br>
    <span style="font-size:16px;">3. Color: <strong><span style="color:#ff0000;background-color:#ffffff;">Black / White / Blue /</span></strong> <strong><span style="color:#ff0000;background-color:#ffffff;">Red</span></strong></span><br>
    <span style="font-size:16px;">4. Dimensions: (7.48 x 7.87 x 3.35)" / (19 x 20 x 8.5)cm (L x W x H)</span><br>
    <span style="font-size:16px;">5. Driver Unit: 2 Speakers(57mm)</span><br>
    <span style="font-size:16px;">6. Transducer Type: Dynamic</span><br>
    <span style="font-size:16px;">7. Bluetooth Version: V4.1</span><br>
    <span style="font-size:16px;">8. Operating Distance: up to 10m(Free Space)</span><br>
    <span style="font-size:16px;">9. Profiles: A2DP,AVRCP,HSP,HFP</span><br>
    <span style="font-size:16px;">10. Speaker Impendence: 16Ω</span><br>
    <span style="font-size:16px;">11. Frequency Response: 20Hz-20KHz</span><br>
    <span style="font-size:16px;">12. Sensitivity: 110dB</span><br>
    <span style="font-size:16px;">13. THD: &lt;0.1%</span><br>
    <span style="font-size:16px;">14. <span style="color:#6600ff;">Support Micro SD Card Capacity</span>: up to 32GB</span><br>
    <span style="font-size:16px;">15. Micro SD Card Play Time: about 15 hours</span><br>
    <span style="font-size:16px;">16. FM Play Time: about 15 hours</span><br>
    <span style="font-size:16px;">17. Bluetooth Music Time: about 40 hours</span><br>
    <span style="font-size:16px;">18. Talk Time: about 45 hours</span><br>
    <span style="font-size:16px;">19. Standby Time: 1620 hours(around 50Days)</span><br>
    <span style="font-size:16px;">20. Fully Charged Time: about 2 hours</span><br>
    <span style="font-size:16px;">21. Operating Environment: -10~50℃</span></p><span style="font-size:16px;"></span>
    <ul>
        <li><span style="font-size:16px;"></span></li>
        <li><span style="font-size:16px;"></span></li>
        <li><span style="font-size:16px;"></span></li>
    </ul>
    <p></p>
</body>
</html>

For some reasons that no one knows this has this part without content :

which i want to remove, while keeping the breaklines.

Current output example:

<html>
<head>
    <title></title>
</head>
<body>
    <p><strong><span style="font-size:16px;">Specifications:</span></strong> <span style="font-size:16px;">1. Material: Polymer</span> <span style="font-size:16px;">2. Weight: 7.94oz / 225g</span> <span style="font-size:16px;">3. Color: <strong><span style="color:#ff0000;background-color:#ffffff;">Black / White / Blue /</span></strong> <strong><span style="color:#ff0000;background-color:#ffffff;">Red</span></strong></span> <span style="font-size:16px;">4. Dimensions: (7.48 x 7.87 x 3.35)" / (19 x 20 x 8.5)cm (L x W x H)</span> <span style="font-size:16px;">5. Driver Unit: 2 Speakers(57mm)</span> <span style="font-size:16px;">6. Transducer Type: Dynamic</span> <span style="font-size:16px;">7. Bluetooth Version: V4.1</span> <span style="font-size:16px;">8. Operating Distance: up to 10m(Free Space)</span> <span style="font-size:16px;">9. Profiles: A2DP,AVRCP,HSP,HFP</span> <span style="font-size:16px;">10. Speaker Impendence: 16Ω</span> <span style="font-size:16px;">11. Frequency Response: 20Hz-20KHz</span> <span style="font-size:16px;">12. Sensitivity: 110dB</span> <span style="font-size:16px;">13. THD: &lt;0.1%</span> <span style="font-size:16px;">14. <span style="color:#6600ff;">Support Micro SD Card Capacity</span>: up to 32GB</span><span style="font-size:16px;">15. Micro SD Card Play Time: about 15 hours</span> <span style="font-size:16px;">16. FM Play Time: about 15 hours</span> <span style="font-size:16px;">17. Bluetooth Music Time: about 40 hours</span> <span style="font-size:16px;">18. Talk Time: about 45 hours</span><span style="font-size:16px;">19. Standby Time: 1620 hours(around 50Days)</span><span style="font-size:16px;">20. Fully Charged Time: about 2 hours</span><span style="font-size:16px;">21. Operating Environment: -10~50℃</span></p>
</body>
</html>

Desired output example:

<html>
<head>
    <title></title>
</head>
<body>
    <p><strong><span style="font-size:16px;">Specifications:</span></strong><br>
    <span style="font-size:16px;">1. Material: Polymer</span><br>
    <span style="font-size:16px;">2. Weight: 7.94oz / 225g</span><br>
    <span style="font-size:16px;">3. Color: <strong><span style="color:#ff0000;background-color:#ffffff;">Black / White / Blue /</span></strong> <strong><span style="color:#ff0000;background-color:#ffffff;">Red</span></strong></span><br>
    <span style="font-size:16px;">4. Dimensions: (7.48 x 7.87 x 3.35)" / (19 x 20 x 8.5)cm (L x W x H)</span><br>
    <span style="font-size:16px;">5. Driver Unit: 2 Speakers(57mm)</span><br>
    <span style="font-size:16px;">6. Transducer Type: Dynamic</span><br>
    <span style="font-size:16px;">7. Bluetooth Version: V4.1</span><br>
    <span style="font-size:16px;">8. Operating Distance: up to 10m(Free Space)</span><br>
    <span style="font-size:16px;">9. Profiles: A2DP,AVRCP,HSP,HFP</span><br>
    <span style="font-size:16px;">10. Speaker Impendence: 16Ω</span><br>
    <span style="font-size:16px;">11. Frequency Response: 20Hz-20KHz</span><br>
    <span style="font-size:16px;">12. Sensitivity: 110dB</span><br>
    <span style="font-size:16px;">13. THD: &lt;0.1%</span><br>
    <span style="font-size:16px;">14. <span style="color:#6600ff;">Support Micro SD Card Capacity</span>: up to 32GB</span><br>
    <span style="font-size:16px;">15. Micro SD Card Play Time: about 15 hours</span><br>
    <span style="font-size:16px;">16. FM Play Time: about 15 hours</span><br>
    <span style="font-size:16px;">17. Bluetooth Music Time: about 40 hours</span><br>
    <span style="font-size:16px;">18. Talk Time: about 45 hours</span><br>
    <span style="font-size:16px;">19. Standby Time: 1620 hours(around 50Days)</span><br>
    <span style="font-size:16px;">20. Fully Charged Time: about 2 hours</span><br>
    <span style="font-size:16px;">21. Operating Environment: -10~50℃</span></p>
</body>
</html>

Thank you in advance.

Doralb5
  • 15
  • 6
  • @Nick this will get stuck in forever loop. Tried already because of while loop, if you do it with foreach if you remove a empty node and the parent node becomes empty again would not solve what i trying to achieve. – Doralb5 Apr 04 '20 at 02:42
  • 1
    @Doralb5 as part of a foreach it shouldn't get stuck. But it seems I misunderstood your question. Please expand your question as mickmackusa requests. – Nick Apr 04 '20 at 02:44
  • @Nick just like mickmackusa suggested i filled two examples. Lest asume the you have for some reason a

  • . it will remove only leaving
  • again an empty node.
  • – Doralb5 Apr 04 '20 at 02:52
  • @mickmackusahow about now. There is the input example, the current output and the desired output. I want to remove the elemets without content, but keeping the break lines. In simple words, I want to clean an html, while keeping it readable. I have also other functions which cleans scripts and styles, but they are not relevant for the issue I am facing. The question is simple: what query I must use to select empty nodes, while ignoring the breaklines and horisontal lines or even tags later on. – Doralb5 Apr 04 '20 at 03:16
  • So, you want to recursively remove element that have no content or tags inside of them. The empty `` tags are removed, which in turn demands that the `
  • ` parent tags are removed, which in turn means that the `
      ` tag has no children and should also be removed. Furthermore, the `

      ` tag has no content or tags, so it should be removed.

  • – mickmackusa Apr 04 '20 at 03:31
  • 1
    @mickmackusa yes, but the current code I have removes also the `
    ` tags which I dont want to happen. Thats why I want to ignore the `
    ` tags in the query. This line `21. Operating Environment: -10~50℃` wont be eleminated because it has text inside, this means the parant element wont be empty and will not get eleminates.
    – Doralb5 Apr 04 '20 at 04:48