I am using https://github.com/FriendsOfPHP/Goutte to parse and extract data and I am doing well...
But now I stumbled upon a slightly unfriendly spot:
<tr>
<th>Website:</th>
<td>
<a href="http://www.adres.com" target="_blank">http://www.adres.com</a>
</td>
</tr>
I am trying to get text from a td
element which immediately follows a th
element which contains a specific string, Website:
in this case.
My php looks like this:
$client3 = new \Goutte\Client();
$crawler3 = $client3->request('GET', $supplierurl . 'contactinfo.html');
if($crawler3->filter('th:contains("+Website+") + td a')->count() > 0) {
$parsed_company_website_url = $crawler3->filter('th:contains("Website:") + td')->text();
} else {
$parsed_company_website_url = null;
}
return $parsed_company_website_url;
Problem
My code doesn't work.
My attempts- I tried using both
"+Website+"
and"Website:"
- I tried to do some smart targeting by counting rows of the table, but each DB entry on the target site arranges items differently, no reliable pattern.
TO DO
Make the script extract the text from a