Web Scrape Symfony2 - Impossible Challenge - Crawler Parsing

Question

(Edit: I've still found no way of solving this problem. The $crawler object seems ridiculous to work with, I just want to parse it for a specific <td> text, how hard is that? I cannot serialize() the entire crawler object either and make the entire source code for the web page into a string, or else I could just parse that string the hard way. Please help. I feel I've described the problem well, below.)

Below I'm using Symfony, Goutte, and DomCrawler to scrape a web page. I've been trying to figure it out through other questions with no success, but now I'm just going to post all my code to make this as straight forward as possible.

I am able to get the page and get the first bit of data I'm looking for. The first is a url that is printed from javascript and lies withing an a tag with an onclick and is a long string, so I use a preg_match to sift through and get exactly what I need.

The next bit of data I need is some text within a <td> tag. The thing is, this web page has 10-20 different <table> tags, and there are no id="" or class="" tags so it's hard to isolate. So what I'm trying to do is search for the words "Event Title" then go to the next sibling <td> tag and extract the innerHtml of that, which will be the actual title.

The problem is that for the second part I can't seem to parse properly through the $crawler object. I don't understand, I did a preg_match before on a serialize() version of the $crawler object, but for the bottom half I can't seem to parse through properly.

$crawler = $client->request('GET', 'https://movies.randomjunk.com/events/EventServlet?ab=mov&eventId=154367');



$aurl = 'http://movies.randomjunk.com/r.htm?e=154367'; // event url beginning string
$gas = $overview->filter('a[onclick*="' . $aurl . '"]');

$string1 = serialize($gas->filter('a')->attr('onclick')); //TEST
$string1M = preg_match("/(?<=\')(.*?)(?=\')/", $string1, $finalURL); 
$aString = $finalURL[0];
echo "<br><br>" . $aString . "<br><br>";
// IT WORKS UP TO HERE


// $title = $crawler->filterXPath('//td[. = "Event Title"]/following-sibling::td[1]')->each(funtion (Crawler $crawler, $i) {
//     return $node->text();
// }); // No clue why, but this doesn't work. 

$html = $overview->getNode(0)->ownerDocument->saveHTML();


$re = "/>Event\sTitle.*?<\\/td>.*?<td>\\K.*?(?=<\\/td>)/s";
$str = serialize($html);
print_r($str);
preg_match_all($re, $str, $matches);
$gas2 = $matches[0];


echo "<pre>";
    print_r($gas2);
echo "</pre>";

My preg_match just returns an empty array. I think it's a problem with searching the $crawler object, since it's made up of many nodes. I've been trying to just convert it all to html then to a preg_match but it just refuses to work. I've done a few print_r statements, and it just returns the whole web page.

Here's an example of some of the html in side the crawler object:

{lots of other html and tables}
<table> 
    <tr>
        <td>Title</td>
        <td>The Harsh Face of Mother Nature</td>
        <td>The Harsh Face of Mother Nature</td>
    </tr>
    .
    .
</table>
{lots of other html and tables}

And the goal is to parse through the entire page/$crawler object and get the title "The Harsh Face of Mother Nature".

I know this must be possible, but the only answer anyone wants to provide is a link to the domcrawler page which I've read about a thousand times at this point. Please help.

What data exactly you need to get from all this? the titles? — Nawfal Serrar, Mar 27 '15 at 05:48
At the bottom I've listed the goal, to get the title from the `` *The Harsh Face of Mother Nature*. This will be dynamic and always changing, but the previous `` will stay the same, `Title`. So I must find that `` then go to it's next sibling and there will be my answer. — Kenny, Mar 27 '15 at 13:12

Shaun Bramley · Accepted Answer · 2015-04-01T20:44:05.523

2

Given the html fragment above I was able to come up with the XPath of:

//table/tr/td[.='Title']/following-sibling::td[1]

You can test the XPath with your provided html fragment at Here

$html = '<table><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr></table>';
$crawler = new Symfony\Component\DomCrawler\Crawler($html);

$query = "//table/tr/td[.='Event Title']/following-sibling::td[1]";
$crawler->filterXPath($query)->each(function($crawler, $i) {
echo $crawler->text() . PHP_EOL;

});

Which outputs:

The Harsh Face of Mother Nature
The Harsh Face of Mother Nature
The Harsh Face of Mother Nature

Update: Tested successfully with:

$html = '<html><table><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr></table><table><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr></table><table><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr></table></html>';

Update: After being provided with sample html from the website I was able to get things to parse with the following XPath:

//td[normalize-space(text()) = 'Event Title']/following-sibling::td[1]

The real issue was the leading and trailing white space that was around "Event Title".

edited Apr 01 '15 at 20:44

answered Mar 31 '15 at 23:55

Shaun Bramley

1,989
11
16

Right now I'm getting `The current node list is empty.` so I'm got to figure out why. – Kenny Apr 01 '15 at 00:59
Are you able to provide a copy / paste of the actual page source? Also if the nodelist is empty then the issue will probably be with the .='Title' portion of the XPath expression. – Shaun Bramley Apr 01 '15 at 01:50
Your new code removed any errors, but now I get nothing. I put text into the echo and it seems the `each()` function isn't even being accessed. My text is "Event Title". I don't need to put a %20 or \s in for spaces in Xpath queries, do I? – Kenny Apr 01 '15 at 01:51
I mean, I put "Event Title" in, but the `each()` function still isn't being accessed. I think it's something with the `$crawler` object. See, I scraped a huge web page full of a LOT of crap. I mentioned in my post that this clump of html has over 10 tables in it, so not only do I have to access the right table, but the right td within. I'm not sure how nodes work in Symfony, since the documentation doesn't go real deep, but I think that it just accesses the first `` then doesn't go past. So perhaps if I had an `each()` function looping through all the tables then a check on each td.
– Kenny Apr 01 '15 at 02:18
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/74222/discussion-between-shaun-bramley-and-kenny). – Shaun Bramley Apr 01 '15 at 02:33

score 0 · Answer 2 · answered Mar 27 '15 at 14:20

0

Alright , what you can do is using a class in your :

<td class="mytitle">The Harsh Face of Mother Nature</td>

Which you will use to filter your crawler to get all your titles in an array like this :

$titles = $crawler->filter('td.mytitle')->extract(array('_text'));

where td.mytitle is a css selector, select td with mytitle class and extract _text which refer to the text inside the node.

Easy and more performant than regex...

Didn't tested this code but it should work, you can get more help and more informations about the crawler here :

http://symfony.com/fr/doc/current/components/dom_crawler.html

answered Mar 27 '15 at 14:20

Nawfal Serrar

2,213
1
14
22

I'm scraping a web page that I have no control over, so there is no way to add a class, or else this would be super easy. – Kenny Mar 27 '15 at 14:23
have you tried $crawler->filter('html:contains("Title")')->each(function ($node) { $node->siblings()->first()->extract(array('_text')); }); – Nawfal Serrar Mar 27 '15 at 14:32
Hey, thanks for your response. `$crawler->filter('html:contains("Title")');` I tried this line of code first, and it returns the html from the entire page. When I `print_r` I just get the entire page. I also tried the code you gave me and it returns an empty array. – Kenny Mar 27 '15 at 15:31

Huzaifa · Answer 3 · 2020-03-21T16:04:48.867

Here is another answer for this question.

use Weidner\Goutte\GoutteFacade;
use Symfony\Component\DomCrawler\Crawler;


$crawler = GoutteFacade::request('GET','http://localhost/php_notes.php');

// find the parent table 
$table = $crawler->filter('table')->each(function($table){

    $tdText = $table->filter('td')->each(function ($node){


        $alike = $node->previousAll(); // calculate the elements of the same level above this element :Will return array containing the tags above this tag.

        $elementTag = $alike->eq(0); // find the tag above this <td> tag. 

        if($elementTag->nodeName()=='td'){

            if($elementTag->text()=='Title')
            {
                dump("Title Heading => ".$elementTag->text()); // Title
                dd("Title Value => ".$node->text()); // The Harsh Face of Mother Nature
            }
        }


    });
});

You will need to make some changes to Symfony\dom-crawler\Crawler.php file on 567 line.

public function nodeName()
    {
        if (!$this->nodes) {
            return null;
            // throw new \InvalidArgumentException('The current node list is empty.');
        }

        return $this->getNode(0)->nodeName;
    }

Web Scrape Symfony2 - Impossible Challenge - Crawler Parsing

3 Answers3

Linked