0

Here is a same of code I have extracted from a webpage...

<div class="user-details-narrow">
            <div class="profileheadtitle">
                <span class=" headline txtBlue size15">
                    Profession
                </span>
            </div>
            <div class="profileheadcontent-narrow">
                <span class="txtGrey size15">
                    administration
                </span>
            </div>
        </div>

<div class="user-details-narrow">
            <div class="profileheadtitle">
                <span class=" headline txtBlue size15">
                    Industry
                </span>
            </div>
            <div class="profileheadcontent-narrow">
                <span class="txtGrey size15">
                    banking
                </span>
            </div>
        </div>

What I want to achieve is to extract the data within those DIVs. For example...

Profession = administrator Industry = bank

Currently I am pulling the webpage with Curl, then stripping out the html tags, and using hundreds of preg_match and if functions. While the solution works very well, it does use a lot of cpu and ram.

It has been suggested I use DOMDocument instead but I can't seem to get anything to work, mostly due to lack of knowledge.

Can someone give me a idea how to extract this data?

  • You should defiantly use DOM Document as you can then put out the data that you need, have you got an example to update your post with the PHP side with your CURL request? – Danny Broadbent Jul 01 '15 at 15:20
  • Thanks for that, I'll give it a go... –  Jul 01 '15 at 15:24
  • @AndyUK: Ignore the comment, there's a mistake in there (`DOMDocument::xpath` method doesn't exist), I've posted an answer showing the correct way to use xpath to query the DOM – Elias Van Ootegem Jul 01 '15 at 15:29
  • possible duplicate of [How do I extract keyword from webpage using PHP DOM](http://stackoverflow.com/questions/30954037/how-do-i-extract-keyword-from-webpage-using-php-dom) – chris85 Jul 02 '15 at 15:34

1 Answers1

0

Posting my comment from earlier as a possible anser, with some explanation as to why I think this is how you could solve the problem:

$dom = new DOMDocument;
$dom->loadHTML($theHtmlString);
//get all profileheadtitle nodes
//they seem to contain the first bits of info you're after
$xpath = new DOMXpath($dom);
$titles = $xpath->query('//*[@class="profileheadtitle"]);
//let's iterate over them, using the `textContent` property to get the value
foreach ($titles as $div)
{
    //each node also has a second div right next to it
    //it's on the same level and we need its content, too
    //enter the DOMNode::$nextSibling property
    echo $div->textContent . ' ' . $div->nextSibling->textContent;
}

Job done. Do check the DOMNode class docs for details, and you might want to read up on the DOMXpath class, too

Note that this bit: $xpath->query('//*[@class="profileheadtitle"]); queries the DOM for all nodes that have the profileheadtitle class. If you want to restrict the nodes to just the <div> elements that have this class, then you can write this:

$xpath->query('//div[@class="profileheadtitle"]);

It's also important to understand that, though effective, this xpath notation will not work if some (or all) of the div's have multiple classes. It only returns the nodes that have one class. The more academically correct way would be to write this:

$xpath->query(
    '//div/[contains(concat(" ", normalize-space(@class), " "), concat(" ", "profileheadtitle", " "))]'
);

This will be able to handle nodes like:

and

<div id="bar" class="foo profileheadtitle mark-red" style="border: 1px solid black;"></div>
Elias Van Ootegem
  • 74,482
  • 9
  • 111
  • 149
  • Fantastic, I'll let you know how I get on. Thanks for that !! –  Jul 01 '15 at 15:42
  • About the class problem, a solution consists to register a php function (for example called `hasClass`) instead of using an equality or `contains`: http://php.net/manual/en/domxpath.registerphpfunctions.php – Casimir et Hippolyte Jul 01 '15 at 16:01
  • It appears it is only listing the questions (profession, industry), not the answers. –  Jul 01 '15 at 17:00