Skip first two statement of a site when extracted by a PHP web crawler

Question

I have a PHP web crawler which works perfectly fine (for now)

It extracts forum questions and their links from a site and pastes it in my site.

so, i been trying to make it do the same except this time, i want it to skip 2 line from the extracting site. so instead of getting all the statements from the site, it will start from statement 3.

My code goes as:

<?php
    function get_data($url) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_URL,$url);
        $result=curl_exec($ch);
        curl_close($ch);
        return $result;
    }
    $returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
    $first_step = explode( '<tbody id="threadbits_forum_26"' , $returned_content );
    $second_step = explode('</tbody>', $first_step[1]);
    $third_step = explode('<tr>', $second_step[0]);
    // print_r($third_step);
    foreach ($third_step as $key=>$element) {
        $child_first = explode( '<td class="alt1"' , $element );
        $child_second = explode( '</td>' , $child_first[1] );
        $child_third = explode( '<a href=' , $child_second[0] );
        $child_fourth = explode( '</a>' , $child_third[1] );
        $final = "<a href=".$child_fourth[0]."</a></br>";
        echo '<li target="_blank" class="itemtitle">';
        if($key < 5 && $key > 2 && rand(0,1) == 1) {
            echo '<span class="item_new">new</span>';
        }
        echo $final;
        echo '</li>';
        if($key==10) {
            break;
        }
    }
?>

Any help is appreciated..

MrDarkLynx · Accepted Answer · 2017-02-20T12:58:00.970

2

You could introduce a variable $i and increase it every foreach-step. Then only execute your code after it was icreased twice:

<?php
    function get_data($url) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_URL,$url);
        $result=curl_exec($ch);
        curl_close($ch);
        return $result;
    }
    $returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
    $first_step = explode( '<tbody id="threadbits_forum_26"' , $returned_content );
    $second_step = explode('</tbody>', $first_step[1]);
    $third_step = explode('<tr>', $second_step[0]);
    // print_r($third_step);
    $i = 1;
    foreach ($third_step as $key=>$element) {
        if ($i < 3) {
            $i++;
            continue;
        }
        $child_first = explode( '<td class="alt1"' , $element );
        $child_second = explode( '</td>' , $child_first[1] );
        $child_third = explode( '<a href=' , $child_second[0] );
        $child_fourth = explode( '</a>' , $child_third[1] );
        $final = "<a href=".$child_fourth[0]."</a></br>";
        echo '<li target="_blank" class="itemtitle">';
        if($key < 5 && $key > 2 && rand(0,1) == 1) {
            echo '<span class="item_new">new</span>';
        }
        echo $final;
        echo '</li>';
        if($key==10) {
            break;
        }
    }
?>

edited Feb 20 '17 at 12:58

answered Feb 20 '17 at 12:50

MrDarkLynx

686
1
9
15

it says `undefined constant i - assumed 'i'` – harishk Feb 20 '17 at 12:57
Oh thanks man that solved it.. and if you don't mind and if you got a free minute.. please check this other question too.. http://stackoverflow.com/questions/42137646/extracting-site-data-through-web-crawler-outputs-error-due-to-mis-match-of-array – harishk Feb 20 '17 at 13:02
@harishk Are you not satisfied with the solution you got there? I'm sorry, I don't really understand what you want me to do.. – MrDarkLynx Feb 20 '17 at 13:08
Yes man,,, that was great.. it solved my problem.... i m asking another favor from you to solve my another question.... link given above.. please take a look at it.. i even tried bounty on it.. but none can able to solve it//.. – harishk Feb 20 '17 at 13:11
@harishk You got me wrong, I mean the question you linked. You've already marked an answer as accepted there? – MrDarkLynx Feb 20 '17 at 13:13
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/136155/discussion-between-harishk-and-mrdarklynx). – harishk Feb 20 '17 at 13:16

mickmackusa · Answer 2 · 2018-03-07T01:09:28.237

I am not quite sure the logic behind your <span>new</span> randomizer, but I can assure you that chopping at html data with string functions is not trustworthy (when it fails, it will fail silently). Instead, I'll recommend DomDocument with Xpath for your task.

Code: (Demo)

$dom=new DOMDocument; 
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = '';
foreach ($xpath->evaluate("//td[@class='alt1']/a") as $i => $node) {  // target a tags that have <td class="alt1"> as parent
    if ($i > 1) {  // disqualify first two nodes
        $result .= "<li class=\"itemtitle\"><a href=\"{$node->getAttribute('href')}\" target=\"_blank\">{$node->nodeValue}</a></li>";
        if ($i == 12) { break; }  // set a limit of 10 rows of data (#3 to #13)
    }
}
if ($result) {
    echo "<ul>$result</ul>";
}

Sample Input: (since I didn't want to scrape the posted url)

$html = <<<HTML
<table>
    <tbody id="threadbits_forum_26">
        <tr>
            <td class="alt1">
                <a href="http://www.example1.com">test1</a>
            </td>
        </tr>
        <tr>
            <td class="alt1">
                <a href="http://www.example2.com">test2</a>
            </td>
        </tr>
        <tr>
            <td class="alt1">
                <a href="http://www.example3.com">test3</a>
            </td>
        </tr>
        <tr>
            <td class="alt1">
                <a href="http://www.example4.com">test4</a>
            </td>
        </tr>
        <tr>
            <td class="alt1">
                <a href="http://www.example5.com">test5</a>
            </td>
        </tr>
        <tr>
            <td class="alt1">
                <a href="http://www.example6.com">test6</a>
            </td>
        </tr>
    </tbody>
</table>
HTML;

Output:

<ul>
    <li class="itemtitle"><a href="http://www.example3.com" target="_blank">test3</a></li>
    <li class="itemtitle"><a href="http://www.example4.com" target="_blank">test4</a></li>
    <li class="itemtitle"><a href="http://www.example5.com" target="_blank">test5</a></li>
    <li class="itemtitle"><a href="http://www.example6.com" target="_blank">test6</a></li>
</ul>

@harishk I finally got around to providing you a reliable solution (versus the hacky regex solution I first posted). This is a far superior/trustworthy method for you to use. If you can explain the logic behind the `rand()` part, I can adjust my answer. If you have questions just ask. — mickmackusa, Mar 07 '18 at 01:11

Skip first two statement of a site when extracted by a PHP web crawler

2 Answers2