php, strpos extract digit from string

Question

I have a huge html code to scan. Until now i have been using preg_match_all to extract desired parts from it. The problem from the start was that it was extremely cpu time consuming. We finally decided to use some other method for extraction. I read in some articles that preg_match can be compared in performance with strpos. They claim that strpos beats regex scanner up to 20 times in efficiency. I thought i will try this method but i dont really know how to get started.

Lets say i have this html string:

<li id="ncc-nba-16451" class="che10"><a href="/en/star">23 - Star</a></li>
<li id="ncd-bbt-5674" class="che10"><a href="/en/moon">54 - Moon</a></li>
<li id="ertw-cxda-c6543" class="che10"><a href="/en/sun">34,780 - Sun</a></li>

I want to extract only number from each id and only text (letters) from content of a tags. so i do this preg_match_all scan:

'/<li.*?id=".*?([\d]+)".*?<a.*?>.*?([\w]+)<\/a>/s'

here you can see the result: LINK

Now if i would want to replace my method to strpos functionality how the approach would look like? I understand that strpos returns a index of start where match took place. But how can i use it to:

get all possible matches, not just one
extract numbers or text from desired place in string

Thank you for all the help and tips ;)

i already wrote that in my question, why duplicate facts? im asking if there is a way to accomplish it with non-regex simple string functions — Mevia, Aug 19 '15 at 12:21
Why not use a DOMDocument to parse HTML? Then extract the values from the nodes/attributes you need. — Wiktor Stribiżew, Aug 19 '15 at 12:26

score 3 · Answer 1 · answered Aug 19 '15 at 12:29

3

This regex finds a match in 24 steps using 0 backtracks

(?:id="[^\d]*(\d*))[^<]*(?:<a href="[^>]*>[^a-z]*([a-z]*))

The regex you posted requires 134 steps. Maybe you will notice a difference? Note that regex engines can optimize so that in minimizes backtracking. I used the debugger of RegexBuddy to come to the numbers.

answered Aug 19 '15 at 12:29

buckley

13,690
3
53
61

this regex unfortunetely finds a lot more data then it should, it is a lot more generic, it was a reason why mine was designed the way it was designed – Mevia Aug 19 '15 at 15:11
Almost all regexes are a function of given input. Making a regex useful is a process of discovering and eliminating false positives and false negatives. If your input is not constraint in some way a HTML parser is the way to go. If there are constraints it can result in a nice and "simple" regex that does the job. I saw that your regex uses some backtracking (not too much though) which hinders good performance. – buckley Aug 19 '15 at 18:18
yes, the problem is that regex needs to target specifically the `
` items that have `` tag inside and data inside also have some specific format, so that is why my regex generate backtracks, as it is trying some `
` but it doesnt find `id` for example or `a` inside. This is why i want to change it for more efficient method. But nonetheless your regex is of super quality and would serve well if only conditions allow. So thank you for sharing it;)

Mevia

Aug 20 '15 at 07:01

score 3 · Accepted Answer · answered Aug 19 '15 at 12:36

Using DOM

$html = '
<html>
<head></head>
<body>
<li id="ncc-nba-16451" class="che10"><a href="/en/star">23 - Star</a></li>
<li id="ncd-bbt-5674" class="che10"><a href="/en/moon">54 - Moon</a></li>
<li id="ertw-cxda-c6543" class="che10"><a href="/en/sun">34,780 - Sun</a></li>
</body>
</html>';


$dom_document = new DOMDocument();

$dom_document->loadHTML($html);

$rootElement = $dom_document->documentElement;

$getId = $rootElement->getElementsByTagName('li');
$res = [];
foreach($getId as $tag)
{
   $data = explode('-',$tag->getAttribute('id'));
   $res['li_id'][] = end($data);
}
$getNode = $rootElement->getElementsByTagName('a');
foreach($getNode as $tag)
{
   $res['a_node'][] = $tag->parentNode->textContent;
}
print_r($res);

Output :

Array
(
    [li_id] => Array
        (
            [0] => 16451
            [1] => 5674
            [2] => c6543
        )

    [a_node] => Array
        (
            [0] => 23 - Star
            [1] => 54 - Moon
            [2] => 34,780 - Sun
        )

)

php, strpos extract digit from string

2 Answers2