How to make a small php link "spider" and extract data?

Question

I want to spider a simple white website that has lot's of html links that represent a phone number' name and address. From each page i want to extract the exact 3 fields that are between the 3 TD's such as:

    <div id="idTabResults2" align="center">
        <TABLE border='1'>
    <tr><th>Name</th><th>Adress</th><th>Phone number</th></tr>
    <TR>
          <TD>Joe</TD><TD>New York</TD><TD>555999</TD></TR>
    </TABLE>

    </div>

So in the example above i would get "Joe", "New York" & 555999. I'm using php and mysql later to insert every result to my DB. Can someone point me to the right direction on how to go about this?

You probably want an HTML parser, not a regex – fge Dec 25 '11 at 23:52 — fge, Dec 25 '11 at 23:52

score 1 · Answer 1 · answered Dec 25 '11 at 23:52

1

You can retrieve the page content using cURL.

Once you have the content you can parse it with PHP's DOM.

Do not attempt to try and parse it using regex. God will kill a kitten just for that.

answered Dec 25 '11 at 23:52

PeeHaa

71,436
58
190
262

So how can i parse using the DOM? really new to that one. – Tom Dec 25 '11 at 23:54

score 1 · Accepted Answer · answered Dec 25 '11 at 23:56

1

Maybe a faster (and simpler) way than PeeHaa's solution:

Retrieve the page using file_get_contents()
Parse it with Simple DOM Parser

For instance:

<?php
require("simple_html_dom.php");
$data = file_get_contents(YOUR_PAGE_HERE);
$html = str_get_html($data);
$tds = $html->find('td');

foreach ($tds as $td) {
  // Do something
}
?>

answered Dec 25 '11 at 23:56

ldiqual

15,015
6
52
90

And to iterate for each link? – Tom Dec 25 '11 at 23:58
Yes, exactly. Just make your loop begin after `require` and end after the `foreach` curly bracket. – ldiqual Dec 26 '11 at 00:01

How to make a small php link "spider" and extract data?

2 Answers2