0

I want to spider a simple white website that has lot's of html links that represent a phone number' name and address. From each page i want to extract the exact 3 fields that are between the 3 TD's such as:

    <div id="idTabResults2" align="center">
        <TABLE border='1'>
    <tr><th>Name</th><th>Adress</th><th>Phone number</th></tr>
    <TR>
          <TD>Joe</TD><TD>New York</TD><TD>555999</TD></TR>
    </TABLE>

    </div>

So in the example above i would get "Joe", "New York" & 555999. I'm using php and mysql later to insert every result to my DB. Can someone point me to the right direction on how to go about this?

PeeHaa
  • 71,436
  • 58
  • 190
  • 262
Tom
  • 9,275
  • 25
  • 89
  • 147

2 Answers2

1

You can retrieve the page content using cURL.

Once you have the content you can parse it with PHP's DOM.

Do not attempt to try and parse it using regex. God will kill a kitten just for that.

PeeHaa
  • 71,436
  • 58
  • 190
  • 262
1

Maybe a faster (and simpler) way than PeeHaa's solution:

For instance:

<?php
require("simple_html_dom.php");
$data = file_get_contents(YOUR_PAGE_HERE);
$html = str_get_html($data);
$tds = $html->find('td');

foreach ($tds as $td) {
  // Do something
}
?> 
ldiqual
  • 15,015
  • 6
  • 52
  • 90