Parsing with Goutte - how to target an element after one containing a text string

Question

I am using https://github.com/FriendsOfPHP/Goutte to parse and extract data and I am doing well...

But now I stumbled upon a slightly unfriendly spot:

<tr>
<th>Website:</th>
<td>
    <a href="http://www.adres.com" target="_blank">http://www.adres.com</a>
</td>
</tr>

I am trying to get text from a td element which immediately follows a th element which contains a specific string, Website: in this case.

My php looks like this:

$client3 = new \Goutte\Client();
$crawler3 = $client3->request('GET', $supplierurl . 'contactinfo.html');

if($crawler3->filter('th:contains("+Website+") + td a')->count() > 0) {
    $parsed_company_website_url = $crawler3->filter('th:contains("Website:") + td')->text();
} else {
    $parsed_company_website_url = null;
}
return $parsed_company_website_url;

Problem

My code doesn't work.

My attempts

I tried using both "+Website+" and "Website:"
I tried to do some smart targeting by counting rows of the table, but each DB entry on the target site arranges items differently, no reliable pattern.

TO DO

Make the script extract the text from a

score 0 · Answer 1 · answered Aug 21 '17 at 12:16

Seems that contains() is a jquery feature and not a css selector. With css, you may inspect attribute value but not the text node inside a markup.

So, in your case, I would use xpath selector, especially: following-sibling (see https://stackoverflow.com/a/29380551/1997849)

score 0 · Answer 2 · answered Mar 21 '20 at 15:26

Here is a solution to your question.

The table in php_notes.php file.

<table id="table" border="1">
    <tr>
    <a href="">xyz</a>
    <a href="">abc</a>
    <h1>Heading</h1>
    <th>Website:</th>
    <td>
        <a href="http://www.adres.com" target="_blank">http://www.adres.com</a>
    </td>
    <th>Website:abc</th>
    <td>
        <a href="http://www.adres.com" target="_blank">http://www.ares.com</a>
    </td>
    </tr>
</table>

Crawler.php finds the text in anchor tag from php_notes.php file.

use Weidner\Goutte\GoutteFacade;
use Symfony\Component\DomCrawler\Crawler;

$crawler = GoutteFacade::request('GET','http://localhost/php_notes.php');

        $table = $crawler->filter('#table'); // find the parent table 

        // find each td tag
        $tdText = $table->filter('td')->each(function ($node){

            $alike = $node->previousAll(); // calculate the elements of the same level above this 
            //element :Will return array containing the tags above this tag.

            // dump('Size of array => '.sizeof($alike));
            $elementTag = $alike->eq(0); // find the tag above this <td> tag. 

            // if the tag above this tag is a <th> tag
            if($elementTag->nodeName()=='th'){
                if($elementTag->text()=='Website:')
                {
                    $text = $node->filter('a')->text();

                    dd('Text found form td "'.$text.'"');
                }
            }

        });

        dd('Not Text Was Found From A tag');

You can get help regarding Symfony Crawler from here 'https://symfony.com/doc/current/components/dom_crawler.html'

Parsing with Goutte - how to target an element after one containing a text string

Problem

TO DO

2 Answers2