0

I am getting an php notice when using simple html dom to scrape a website. There are 2 notices displayed and everything rendered underneath looks perfect when using the print_r function to display it.

The website table structure is as follows:

    <table class=data schedTbl>
        <thead>
            <tr>
                <th>DATA</th>
                <th>DATA</th>
                <th>DATA</th>
                etc....
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>
                    <div class="class1">DATA</div>
                    <div class="class2">SAME DATA AS PREVIOUS DIV</div>
                </td>
                <td>DATA</td>
                <td>DATA</td>
                etc....
            </tr>
            <tr>
                <td>
                    <div class="class1">DATA</div>
                    <div class="class2">SAME DATA AS PREVIOUS DIV</div>
                </td>
                <td>DATA</td>
                <td>DATA</td>
                etc....
            </tr>
            <tr>
                <td>
                    <div class="class1">DATA</div>
                    <div class="class2">SAME DATA AS PREVIOUS DIV</div>
                </td>
                <td>DATA</td>
                <td>DATA</td>
                etc....
            </tr>
            etc....
        </tbody>
    </table>

The code below is used to find all tr in table[class=data schedTbl]. I have a tbody selector in there, but it seems to pay no attention to this selector as it still selects the tr in the thead.

    include('simple_html_dom.php');

    $articles = array();

    getArticles('www.somesite.com');

    function getArticles($page) {
         global $articles;

         $html = new simple_html_dom();
         $html->load_file($page);

         $items = $html->find('table[class=data schedTbl] tbody tr');  

         foreach($items as $post) {

             $articles[] = array($post->children(0)->first_child(0)->plaintext,//0 -- GAME DATE
                        $post->children(1)->plaintext,//1 -- AWAY TEAM
                        $post->children(2)->plaintext);//2 -- HOME TEAM

         }

    }

So, I believe notices come from the tr in the thead because I am calling on the first child of the first td which only has one record. The reason for two is there is actually two tables with the same data structure in the body.

Again, I believe there are 2 ways of solving this:

1) PROBABLY THE EASIEST (fix the find selector so the TBODY works and only selects the tds within the tbodies)

2) Figure out a way to not do the first_child filter when it is not needed?

Please let me know if you would like a snapshot of the print_r($articles) output I am receiving.

Thanks in advance for any help provided!

Sincerely,

Bill C.

Bill Chambers
  • 83
  • 1
  • 9
  • Hi all, don't waste your time trying to answer, I got it with the help of a question on this website. It seems in the simple_html_don.php file tbody is avoided. To fix this just comment out line #695, if ($m[1]==='tbody') continue;, and now it reads tbody. – Bill Chambers Jan 25 '13 at 20:58

1 Answers1

0

Just comment out line #695 in the simple_html_dom.php

if ($m[1]==='tbody') continue;

Then it should read the tbody.

Papa De Beau
  • 3,744
  • 18
  • 79
  • 137