I am trying to scrape the quotes from a given wikiquote page using the php package Goutte, which wraps the Symfony Components: BrowserKit, CssSelector and DomCrawler.
However there are certain quotes which I do not want in my result set, the quotes from the misattributed section.
Here is what I have so far:
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'http://en.wikiquote.org/wiki/Thomas_Jefferson');
//grab all the children li's from the wikiquote page
$quotes = $crawler->filter('ul > li');
$quoteArray = [];
//foreach li with a node value that does not start with a number, push the node value onto quote array
//this filters out the table of contents <li> node values which I do not want
foreach($quotes as $quote)
{
if(!is_numeric(substr($quote->nodeValue, 0, 1)))
{
array_push($quoteArray, $quote->nodeValue);
}
}
The problem that I am focusing on at this point is how to filter out the quotes from the misattributed section. This section is contained in a parent div
which has the style
attribute:
style="padding: .5em; border: 1px solid black; background-color:#FFE7CC"
I was thinking that if I can somehow grab the li
node values from this specific section I can then filter them out from my above $quoteArray
. The issue I am having is that I cannot figure out how to select the children li
node values from this section.
I have tried selecting the children with variations of the following:
$badQuotes = $crawler->filter('div[style="padding: .5em; border: 1px solid black; background-color:#FFE7CC"] > ul > li');
But this is not returning the node values that I need. Does anyone know how to do this or what I am doing wrong?