2

I have a page that contains several hyperlinks. The ones I want to get are of the format:

<html>
<body>

<div id="diva">
<a href="/123" >text2</a>
</div>

<div id="divb">
<a href="/345" >text1</a>
<a href="/678" >text2</a>
</div>

</body>
</html>

I want to extract the three hrefs 123,345,and 678.

I know how to get all the hyperlinks using $gm = $xpath->query("//a") and then loop through them to get the href attribute.

Is there some sort of regexp to get the attributes with the above format only (.i.e "/digits")?

Thanks

har07
  • 88,338
  • 12
  • 84
  • 137
fractal5
  • 2,034
  • 4
  • 29
  • 50

1 Answers1

3

XPath 1.0, which is the version supported by DOMXPath(), has no Regex functionalities. Though, you can easily write your own PHP function to execute Regex expression to be called from DOMXPath if you need one, as mentioned in this other answer.

There is XPath 1.0 way to test if an attribute value is a number, which you can use on href attribute value after / character, to test if the attribute value follows the pattern /digits :

//a[number(substring-after(@href,'/')) = substring-after(@href,'/')]

UPDATE :

For the sake of completeness, here is a working example of calling PHP function preg_match from DOMXPath::query() to accomplish the same task :

$raw_data = <<<XML
<html>
<body>

<div id="diva">
<a href="/123" >text2</a>
</div>

<div id="divb">
<a href="/345" >text1</a>
<a href="/678" >text2</a>
</div>

</body>
</html>
XML;
$doc = new DOMDocument;
$doc->loadXML($raw_data);

$xpath = new DOMXPath($doc);

$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions("preg_match");

// php:function's parameters below are :
// parameter 1: PHP function name
// parameter 2: PHP function's 1st parameter, the pattern
// parameter 3: PHP function's 2nd parameter, the string
$gm = $xpath->query("//a[php:function('preg_match', '~^/\d+$~', string(@href))]");

foreach ($gm as $a) {
    echo $a->getAttribute("href") . "\n";
}
Community
  • 1
  • 1
har07
  • 88,338
  • 12
  • 84
  • 137
  • 1
    1+; Too bad `//a[matches(@href, '^/\d+$')]` isn't supported. – Josh Crozier Feb 21 '16 at 06:15
  • Perfect answer. Thank you. Is Xpath2.0 not supported in PHP? – fractal5 Feb 21 '16 at 06:32
  • @fractal5 Not by core PHP. I don't use PHP regularly, maybe there is a library that provides XPath 2.0 support, not sure. Your better bet might be to call PHP functions like `preg_match` or your own PHP function. Example provided in **UPDATE** section. – har07 Feb 21 '16 at 06:43