5

I have a huge html code to scan. Until now i have been using preg_match_all to extract desired parts from it. The problem from the start was that it was extremely cpu time consuming. We finally decided to use some other method for extraction. I read in some articles that preg_match can be compared in performance with strpos. They claim that strpos beats regex scanner up to 20 times in efficiency. I thought i will try this method but i dont really know how to get started.

Lets say i have this html string:

<li id="ncc-nba-16451" class="che10"><a href="/en/star">23 - Star</a></li>
<li id="ncd-bbt-5674" class="che10"><a href="/en/moon">54 - Moon</a></li>
<li id="ertw-cxda-c6543" class="che10"><a href="/en/sun">34,780 - Sun</a></li>

I want to extract only number from each id and only text (letters) from content of a tags. so i do this preg_match_all scan:

'/<li.*?id=".*?([\d]+)".*?<a.*?>.*?([\w]+)<\/a>/s'

here you can see the result: LINK

Now if i would want to replace my method to strpos functionality how the approach would look like? I understand that strpos returns a index of start where match took place. But how can i use it to:

  • get all possible matches, not just one
  • extract numbers or text from desired place in string

Thank you for all the help and tips ;)

Mevia
  • 1,517
  • 1
  • 16
  • 51
  • i already wrote that in my question, why duplicate facts? im asking if there is a way to accomplish it with non-regex simple string functions – Mevia Aug 19 '15 at 12:21
  • 2
    Why not use a DOMDocument to parse HTML? Then extract the values from the nodes/attributes you need. – Wiktor Stribiżew Aug 19 '15 at 12:26

2 Answers2

3

This regex finds a match in 24 steps using 0 backtracks

(?:id="[^\d]*(\d*))[^<]*(?:<a href="[^>]*>[^a-z]*([a-z]*))

The regex you posted requires 134 steps. Maybe you will notice a difference? Note that regex engines can optimize so that in minimizes backtracking. I used the debugger of RegexBuddy to come to the numbers.

buckley
  • 13,690
  • 3
  • 53
  • 61