2

Given the following HTML code snippet:

<div class="item">
  large
  <span class="some-class">size</span>
</div>

I'm looking for the best way to extract the string "large" using Symfony's Crawler.

$crawler = new Crawler($html);

Here I could use $crawler->html() then apply a regex search. Is there a better solution? Or how would you do it exactly?

haxpanel
  • 4,402
  • 4
  • 43
  • 71

3 Answers3

4

I've just found a solution that looks the cleanest to me:

$crawler = new Crawler($html);
$result = $crawler->filterXPath('//text()')->text();
haxpanel
  • 4,402
  • 4
  • 43
  • 71
  • 1
    `$result = $crawler->filterXPath('//div[@class="item"]/text()')->text();` would be better. – COil Nov 18 '15 at 15:38
  • I think we actually don't need this extra selector as the div.item node has already been selected because thats the root node – haxpanel Nov 18 '15 at 19:10
  • But you will never have to handle this sole html snippet, I suppose it may be used when retrieve a large a full html source. – COil Nov 19 '15 at 08:21
  • I'm using css selectors, I'm forced to use xpath just at the end somehow like this: $crawler->filter('div.item')->filterXPath('//text()')->text(); – haxpanel Nov 19 '15 at 08:45
0
$crawler = new Crawler($html);
$node = $crawler->filterXPath('//div[@class="item"]');
$domElement = $node->getNode(0);
foreach ($node->children() as $child) {
    $domElement->removeChild($child);
}
dump($node->text()); die();

After you have to trim whitespace.

COil
  • 7,201
  • 2
  • 50
  • 98
0

This is a bit tricky as the text that you're trying to get is a text node that the DOMCrawler component doesn't (as far as I know) allow you to extract. Thankfully DOMCrawler is just a layer over the top of PHP's DOM classes which means you could probably do something like:

$crawler = new Crawler($html);
$crawler = $crawler->filterXPath('//div[@class="item"]');
$domNode = $crawler->getNode(0);
$text = null;

foreach ($domNode->children as $domChild) {
    if ($domChild instanceof \DOMText) {
        $text = $domChild->wholeText;
        break;
    }
}

This wouldn't help with HTML like:

<div>
    text
    <span>hello</span>
    other text
</div>

So you would only get "text", not "text other text" in this instance. Take a look at the DOMText documentation for more details.

John Noel
  • 1,401
  • 10
  • 13