Load time: is it quicker to parse HTML with PHP's DOMDocument or with Regular Expressions?

Question

I'm pulling images from my Flickr account to my website, and I had used about nine lines of code to create a preg_match_all function that would pull the images.

I've read several times that it is better to parse HTML through DOM.

Personally, I've found it more complicated to parse HTML through DOM. I made up a similar function to pull the images with PHP's DOMDocument, and it's about 22 lines of code. It took awhile to create, and I'm not sure what the benefit was.

The page loads at about the same time for each code, so I'm not sure why I would use DOMDocument.

Does DOMDocument work faster than preg_match_all?

I'll show you my code, if you're interested (you can see how lengthy the DOMDocument code is):

//here's the URL
$flickrGallery = 'http://www.flickr.com/photos/***/collections/***/';

//below is the DOMDocument method
$flickr = new DOMDocument();
$doc->validateOnParse = true;
$flickr->loadHTMLFile($flickrGallery);
$elements = $flickr->getElementById('ViewCollection')->getElementsByTagName('div');
$flickr = array();
for($i=0;$i<$elements->length;$i++){
    if($elements->item($i)->hasAttribute('class')&&$elements->item($i)->getAttribute('class')=='setLinkDiv'){
        $flickr[] = array(
                          'href' => $elements->item($i)->getElementsByTagName('a')->item(0)->getAttribute('href'), 
                          'src' => $elements->item($i)->getElementsByTagName('img')->item(0)->getAttribute('src'), 
                          'title' => $elements->item($i)->getElementsByTagName('img')->item(0)->getAttribute('alt')
                          );
    }
}
$elements = NULL;
foreach($flickr as $k=>$v){
    $setQuery = explode("/",$flickr[$k]['href']);
    $setQuery = $setQuery[4];
    echo '<a href="?set='.$setQuery.'"><img src="'.$flickr[$k]['src'].'" title="'.$flickr[$k]['title'].'" width=75 height=75 /></a>';
}
$flickr = NULL;

//preg_match_all code is below

$sets = file_get_contents($flickrGallery);
preg_match_all('/(class="setLink" href="(.*?)".*?class="setThumb" src="(.*?)".*?alt="(.*?)")+/s',$sets,$sets,PREG_SET_ORDER);
foreach($sets as $k=>$v){
    $setQuery = explode("/",$sets[$k][2]);
    $setQuery = $setQuery[4];
echo '<a href="?set='.$setQuery.'"><img src="'.$sets[$k][3].'" title="'.$sets[$k][4].'" width=75 height=75 /></a>';
}
$sets = NULL;

Why are you asking us? You got the code, so use a profiler and benchmark it. — Gordon, Aug 15 '11 at 16:07
Benchmark benchmark benchmark. Regex will probably be a bit faster for simple patterns, but DOM will be far far more reliable. But, that all depends on just how complex your regex is, and how complicated the dom tree is. Only YOU can figure out which is better/faster overall. — Marc B, Aug 15 '11 at 16:10
You can probably cut down a lot of that DOM code by using XPath. — Gordon, Aug 15 '11 at 16:16
Why are you not just using the API instead of doing a screen scrape??????? — prodigitalson, Aug 16 '11 at 03:51

Andy Lester · Accepted Answer · 2013-01-03T20:31:32.113

If you're willing to sacrifice speed for correctness, then go ahead and try to roll your own parser with regular expressions.

You say "Personally, I've found it more complicated to parse HTML through DOM." Are you optimizing for correctness of results, or how easy it is for you to write the code?

If all you want is speed and code that's not complicated, why not just use this:

$array_of_photos = Array( 'booger.jpg', 'aunt-martha-on-a-horse.png' );

or maybe just

$array_of_photos = Array();

Those run in constant time, and they're easy to understand. No problem, right?

What's that? You want accurate results? Then don't parse HTML with regular expressions.

Finally, when you're working with a parser like DOM, you're working with a piece of code that has been well-tested and debugged for years. When you're writing your own regular expressions to do the parsing, you're working with code that you're going to have to write, test and debug yourself. Why would you not want to work with the tools that many people have been using for many years? Do you think you can do a better job yourself on the fly?

Thanks Andy. This question was based on a site for a client that ended up being scrapped. I agree that DOM parsing is better than regex on principle. — bozdoz, Dec 28 '12 at 21:00

score 2 · Answer 2 · answered Aug 15 '11 at 16:08

2

I would use DOM as this is less likely to break if any small changes are made to the page.

answered Aug 15 '11 at 16:08

Ed Heal

59,252
17
87
127

1

Wouldn't my DOM code break just as easily with changes to the external page? – bozdoz Aug 15 '11 at 16:26
1

Not for all changes... You regex could break for example if whitespace is change slightly in the tag, or if there is a class name added tot he `class` attribute or if the order of the atributes on the tag is changed. Using the DOM and XPath can protect you from nearly all of that. – prodigitalson Aug 16 '11 at 03:55

Load time: is it quicker to parse HTML with PHP's DOMDocument or with Regular Expressions?

2 Answers2

Linked