3

I am trying to scrape the quotes from a given wikiquote page using the php package Goutte, which wraps the Symfony Components: BrowserKit, CssSelector and DomCrawler.

However there are certain quotes which I do not want in my result set, the quotes from the misattributed section.

Here is what I have so far:

use Goutte\Client;

$client = new Client();

$crawler = $client->request('GET', 'http://en.wikiquote.org/wiki/Thomas_Jefferson');

//grab all the children li's from the wikiquote page
$quotes = $crawler->filter('ul > li');

$quoteArray = [];

//foreach li with a node value that does not start with a number, push the node value onto quote array
//this filters out the table of contents <li> node values which I do not want

foreach($quotes as $quote)
{
    if(!is_numeric(substr($quote->nodeValue, 0, 1)))
    {
        array_push($quoteArray, $quote->nodeValue);
    }
}

The problem that I am focusing on at this point is how to filter out the quotes from the misattributed section. This section is contained in a parent div which has the style attribute:

style="padding: .5em; border: 1px solid black; background-color:#FFE7CC"

I was thinking that if I can somehow grab the li node values from this specific section I can then filter them out from my above $quoteArray. The issue I am having is that I cannot figure out how to select the children li node values from this section.

I have tried selecting the children with variations of the following:

$badQuotes = $crawler->filter('div[style="padding: .5em; border: 1px solid black; background-color:#FFE7CC"] > ul > li');

But this is not returning the node values that I need. Does anyone know how to do this or what I am doing wrong?

Fetus
  • 985
  • 2
  • 12
  • 23

1 Answers1

0

DomCrawler filter method will

Filters the list of nodes with a CSS selector.

which is less powerfull than using xpath. I guess CSS selector couldn't convert your complex query into xpath expressions. So, a complex filter should be done by filterXPath method instead which will

Filters the list of nodes with an XPath expression.

So, in your case, try use the filterXPath method:

$crawler->filterXPath("//div[contains(@style,'padding: .5em; border: 1px solid black; background-color:#FFE7CC')]");
ihsan
  • 2,279
  • 20
  • 36