Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

Question

I been trying to extract site table text along with its link from the given table to (which is in site1.com) to my php page using a web crawler.

But unfortunately, due to incorrect input of Array index in the php code, it came error as output.

site1.com

<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="65%" valign="top" class="Title2">Subject</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="8%" valign="top" align="Center" class="Title2">Replies</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837110.php" target="_top" class="Links2">Serious dedicated study partner for U World</a> - step12013</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">10</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>

The php. web crawler as ::

<?php
    function get_data($url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_URL,$url);
    $result=curl_exec($ch);
    curl_close($ch);
    return $result;
    }
    $returned_content = get_data('http://www.usmleforum.com/forum/index.php?forum=1');
    $first_step = explode( '<table class="Table2">' , $returned_content );
    $second_step = explode('</table>', $first_step[0]);
    $third_step = explode('<tr>', $second_step[1]);
    // print_r($third_step);
    foreach ($third_step as $key=>$element) {
    $child_first = explode( '<td class="FootNotes2"' , $element );
    $child_second = explode( '</td>' , $child_first[1] );
    $child_third = explode( '<a href=' , $child_second[0] );
    $child_fourth = explode( '</a>' , $child_third[0] );
    $final = "<a href=".$child_fourth[0]."</a></br>";
?>

<li target="_blank" class="itemtitle">
    <?php echo $final?>
</li>

<?php
    if($key==10){
       break;
        }
    }
?>

Now the Array Index on the above php code can be the culprit. (i guess) If so, can some one please explain me how to make this work.

But what my final requirement from this code is:: to get the above text in second with a link associated to it.

Any help is Appreciated..

Can you describe what are trying to achieve ? maybe we can help you to write a better code as the PHP code above is not clean nor flexible ! — webNeat, Feb 09 '17 at 13:52
i m just trying to get web crawler which can get into a link (mentioned above) and get the links along with text associated with the text into my page (page where php script exist) — harishk, Feb 09 '17 at 13:55
I already made a code like that which does exactly the same job but for another site, since the index array arrangement is different for different sites, the index number wont work for every site. now i m stuck with getting index for this site... — harishk, Feb 09 '17 at 13:56
@gabe3886 I m getting like `unidentified offset 1`... which i m pretty sure due to array index mismatch.. what do you think? — harishk, Feb 14 '17 at 15:55
What happens if you `var_dump($child_first)` in the foreach loop? That will tell you what you're getting in the `$child_first` variable, and what index options are available. If you're only getting 1 instance of ` — gabe3886, Feb 14 '17 at 15:58
@gabe3886 and i did made the same code work for another site. i have posted the code for it as 2nd answer... please take a look at it. — harishk, Feb 14 '17 at 15:59
@gabe3886 It gave an offest 1 error at ` $child_second = explode( '' , $child_first[1] );` — harishk, Feb 14 '17 at 16:01
The issue is that you're searchign specifically on ``. As such, it's not actually managing to match. The reason your other version works is that there's no height or width setting in that part. — gabe3886, Feb 14 '17 at 16:04
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/135704/discussion-between-harishk-and-gabe3886). — harishk, Feb 14 '17 at 16:07

jkrnak · Answer 1 · 2017-02-13T15:03:21.513

Instead of writing your own parser solution you could use an existing one like Symfony's DomCrawler component: http://symfony.com/doc/current/components/dom_crawler.html

$crawler = new Crawler($returned_content);
$linkTexts = $crawler->filterXPath('//a')->each(function (Crawler $node, $i) {
    return $node->text();
});

Or if you want to traverse the DOM tree yourself you can use DOMDocument's loadHTML http://php.net/manual/en/domdocument.loadhtml.php

$document = new DOMDocument();
$document->loadHTML($returned_content);
foreach ($document->getElementsByTagName('a') as $link) {
    $text = $link->nodeValue;
}

EDIT:

To get the links you want, the code assumes you have a $returned_content variable with the HTML you want to parse.

// creating a new instance of DOMDocument (DOM = Document Object Model)
$domDocument = new DOMDocument();
// save previous libxml error reporting and set error reporting to internal
// to be able to parse not well formed HTML doc
$previousErrorReporting = libxml_use_internal_errors(true);
$domDocument->loadHTML($returned_content);
libxml_use_internal_errors($previousErrorReporting);
$links = [];
/** @var DOMElement $node */
// getting all <a> element from the HTML
foreach ($domDocument->getElementsByTagName('a') as $node) {
    $parentNode = $node->parentNode;
    // checking if the <a> is under a <td> that has class="FootNotes2"
    $isChildOfAFootNotesTd = $parentNode->nodeName === 'td' && $parentNode->getAttribute('class') === 'FootNotes2';
    // checking if the <a> has class="Links2"
    $isLinkOfLink2Class = $node->getAttribute('class') == 'Links2';
    // as I assumed you wanted links from the <td> this check makes sure that both of the above conditions are fulfilled
    if ($isChildOfAFootNotesTd && $isLinkOfLink2Class) {
        $links[] = [
            'href' => $node->getAttribute('href'),
            'text' => $parentNode->textContent,
        ];
    }
}

print_r($links);

This will create you an array similar to:

Array
(
    [0] => Array
    (
        [href] => /files/forum/2017/1/837242.php
        [text] => Q@Q Drill Time ① - cardio69
    ) 
    [1] => Array
    (
        [href] => /files/forum/2017/1/837356.php
        [text] => study partner in Houston - lacy
    )
    [2] => Array
    (
        [href] => /files/forum/2017/1/837110.php
        [text] => Serious dedicated study partner for U World - step12013
    )
    ...

bro, thanks for the time but. i would like to have it in my way.although its for learning purpose, i m kinda keen into this way.. if its possible, can you please let me know how to identify array index of a html element for this web crawler. And in other case, is it the array index that messing with the code or any other , if so, please let me know. awaiting man. Thanks./. — harishk, Feb 13 '17 at 05:23
i tried the same thing (code) with other site and it worked just fine. i feel the code is OK for what i feel like i need . help me with array index. thanks.. — harishk, Feb 13 '17 at 10:27
i know very little about php but after looking at your code i feel i know nothing. you may have given me a right answer for my problem but since i dont think i understand it , i can't consider it. please check the answer i just posted.. — harishk, Feb 13 '17 at 12:07
It's not about the indexes of your array. I cannot encourage you to go down that route, it's not a good way to parse. I added more comments to the code. For now you can ignore the lines with `libxml_use_internal_errors`, just focus on the variable names and the comments, it should give you enough clues what is happening. Also please read the documentation for DOMDocument http://php.net/manual/en/class.domdocument.php. Believe me, once you understand what is this DOM and how you can traverse the tree it will make you a better web developer. — jkrnak, Feb 13 '17 at 15:07
where did you mentioned website link in your code to parse links..? — harishk, Feb 13 '17 at 15:19
Dude, i know that you are trying to guide me in a safe and right way to code or most appropriate way. But right now i m begging you (figuratively) just tell me how to make the code (that i have posted ) work. — harishk, Feb 13 '17 at 16:13

score 3 · Answer 2 · edited Mar 07 '18 at 04:27

I tried the same code for another site. and it works. Please take a look at it:

<?php
    function get_data($url) {
      $ch = curl_init();
      curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
      curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
      curl_setopt($ch, CURLOPT_URL,$url);
      $result=curl_exec($ch);
      curl_close($ch);
      return $result;
    }
    $returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
    $first_step = explode( '<tbody id="threadbits_forum_26">' , $returned_content );
    $second_step = explode('</tbody>', $first_step[1]);
    $third_step = explode('<tr>', $second_step[0]);
    // print_r($third_step);
    foreach ($third_step as $element) {
      $child_first = explode( '<td class="alt1"' , $element );
      $child_second = explode( '</td>' , $child_first[1] );
      $child_third = explode( '<a href=' , $child_second[0] );
      $child_fourth = explode( '</a>' , $child_third[1] );
      echo $final = "<a href=".$child_fourth[0]."</a></br>";
    }
    ?>

I know its too much to ask, but can you please make a code out of these two which make the crawler work.

@jkmak

This should form part of the question as a working example of something else, it's no a solution to the question you are asking here — gabe3886, Feb 14 '17 at 16:05
yes, of course its not a solution.. but i just randomly posted it here... will delete it once i got the solution... — harishk, Feb 14 '17 at 16:06
It looks like you can delete this now that you have a solution. — mickmackusa, Mar 26 '18 at 23:53

MrDarkLynx · Accepted Answer · 2017-02-20T13:58:03.380

Using the Simple HTML DOM Parser library, you can use the following code:

<?php
    require('simple_html_dom.php'); // you might need to change this, depending on where you saved the library file.

    $html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');

    foreach($html->find('td.FootNotes2 a') as $element) { // find all <a>-elements inside a <td class="FootNotes2">-element
        $element->href = "http://www.usmleforum.com" . $element->href;  // you can also access only certain attributes of the elements (e.g. the url).
        echo $element.'</br>';  // do something with the elements.
    }
?>

score 0 · Answer 4 · answered Mar 07 '18 at 00:41

Chopping at html with string functions or regex is not a reliable method. DomDocument and Xpath do a nice job.

Code: (Demo)

$dom=new DOMDocument; 
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate("//td[@class = 'FootNotes2']/a") as $node) {  // target a tags that have <td class="FootNotes2"> as parent
    $result[]=['href' => $node->getAttribute('href'), 'text' => $node->nodeValue];  // extract/store the href and text values
    if (sizeof($result) == 10) { break; }  // set a limit of 10 rows of data
}
if (isset($result)) {
    echo "<ul>\n";
    foreach ($result as $data) {
        echo "\t<li class=\"itemtitle\"><a href=\"{$data['href']}\" target=\"_blank\">{$data['text']}</a></li>\n";
    }
    echo "</ul>";
}

Sample Input:

$html = <<<HTML
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="65%" valign="top" class="Title2">Subject</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="8%" valign="top" align="Center" class="Title2">Replies</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837110.php" target="_top" class="Links2">Serious dedicated study partner for U World</a> - step12013</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">10</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837999.php" target="_top" class="Links2">some text</a> - step12013</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">10</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>
HTML;

Output:

<ul>
    <li class="itemtitle"><a href="/files/forum/2017/1/837110.php" target="_blank">Serious dedicated study partner for U World</a></li>
    <li class="itemtitle"><a href="/files/forum/2017/1/837999.php" target="_blank">some text</a></li>
</ul>

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

4 Answers4

Linked