3

Is it possible to access element that contain Unicode class name?

I'm actually accessing this site, but their class name are prefixed with Unicode character U+1F41D HONEYBEE

$html = file_get_contents('https://www.honestbee.my/en/groceries/stores/bens-independent-grocer/products/720365');
$doc = new \DOMDocument();
$doc->loadHTML($html);

$xpath = new \DOMXpath($doc);

$elements = $xpath->query("//[@class='ap0']");
if (!is_null($elements)) {
    foreach ($elements as $element) {
        echo "<br/>[". $element->nodeName. "]";

        $nodes = $element->childNodes;
        foreach ($nodes as $node) {
            echo $node->nodeValue. "\n";
        }
    }
}

Unfortunately it throws error

ErrorException  : DOMXPath::query(): Invalid expression                                                                                                     
 at /paht/to/test-dom.php:83                                                                        
   79|         $doc->loadHTML($html);                                       
   80|                                                                      
   81|         $xpath = new \DOMXpath($doc);                                
   82|                                                                      
 > 83|         $elements = $xpath->query("//[@class='ap0']");             
   84|         if (!is_null($elements)) {                                   
   85|             foreach ($elements as $element) {                        
   86|                 echo "<br/>[". $element->nodeName. "]";              
   87|                                                                      

Exception trace:

1   DOMXPath::query("//[@class='ap0']")                                  
    /paht/to/test-dom.php:83

I was referring to emoji code here, tried with \uD83Dap0 also not working

miken32
  • 42,008
  • 16
  • 111
  • 154
Js Lim
  • 3,625
  • 6
  • 42
  • 80
  • Have you tried `"//[@class='🐝ap0']"`? Not sure where you got D83D, which is a different character. – miken32 Apr 11 '19 at 03:45
  • Tried a few different things, nothing seems to work. Closest I got was `$elements = $xpath->query("//*[@class[contains(., 'ap0')]]");` – miken32 Apr 11 '19 at 04:45
  • @miken32 Thanks. But the `contains` can't make sure target the correct element – Js Lim Apr 11 '19 at 09:26

3 Answers3

3

Well I went down a rabbit hole of character encodings and whatnot, before trying $doc->saveHTML() and noticing that all the Unicode characters were corrupted. My guess is that DOMDocument::loadHTML treats everything as ISO-8859-1, which was the default encoding for HTML 4. So, by adding an XML prologue we can trick it into parsing as UTF-8. This allows you to search by class name, no matter what characters it uses:

<?php
$html = file_get_contents('https://www.honestbee.my/en/groceries/stores/bens-independent-grocer/products/720365');
$prologue = '<?xml encoding="UTF-8">';
$doc = new \DOMDocument();
$doc->loadHTML($prologue . $html);
$xpath = new \DOMXpath($doc);
$elements = $xpath->query("//div[@class='ap0']");
foreach ($elements as $element) {
    echo "<br/>[". $element->nodeName. "]";
    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
        echo $node->nodeValue. " \n";
    }
}

It's also worth noting that your "invalid expression" error was not due to the bee, but rather because you didn't have an element name in your query. In my answer I used div, if you want to search all elements you can use *.

miken32
  • 42,008
  • 16
  • 111
  • 154
0

Actually I'm using Rct567/DomQuery. The author already fix the issue.

For those who facing the same issue, I recommend to use this package.

Js Lim
  • 3,625
  • 6
  • 42
  • 80
0

One workaround is to replace the specific, known, unicode character attribute with an ASCII string. Do this on-the-fly, just before executing XPATH query.

Example: $html = preg_replace("/ap0/u", 'Beeap0123456', $html);

Alternatively, str_replace function should be able to replace an array of unicode attribute names with a mapped array of ASCII attribute names.

Then the XPATH query expression would be a straight-forward ASCII one: '//*[@class="Beeap0123456"]'

(Adding a unique string to the replacement ASCII string might reduce the chance of confusion when the document contains other similar attributes.)

saeng
  • 361
  • 2
  • 4